kubernetes-retired / heapster

[EOL] Compute Resource Usage Analysis and Monitoring of Container Clusters
Apache License 2.0
2.63k stars 1.25k forks source link

Only view containers of the first node in Grafana and has residual data #681

Closed JunejaTung closed 6 years ago

JunejaTung commented 8 years ago

deployed environment

there are 2 ready nodes ,the first is 172.27.8.211, the second is 172.27.8.214. All the pods of heapster are runing on the second node.

[root@wlan-cloudserver31 influxdb-test]# kubectl get nodes
NAME           LABELS                                STATUS
172.27.8.211   kubernetes.io/hostname=172.27.8.211   Ready
172.27.8.212   kubernetes.io/hostname=172.27.8.212   NotReady
172.27.8.214   kubernetes.io/hostname=172.27.8.214   Ready
[root@wlan-cloudserver31 influxdb-test]# kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                     READY     STATUS    RESTARTS   AGE       NODE
default       redis-master-rze58       1/1       Running   2          21h       172.27.8.214
kube-system   heapster-zr8zh           1/1       Running   0          2h        172.27.8.214
kube-system   influxdb-grafana-h9rnx   2/2       Running   0          2h        172.27.8.214
[root@wlan-cloudserver31 influxdb-test]# kubectl get service --all-namespaces -o wide
NAMESPACE     NAME                  LABELS                                                                           SELECTOR             IP(S)            PORT(S)
default       kubernetes            component=apiserver,provider=kubernetes                                          <none>               10.254.0.1       443/TCP
kube-system   heapster              kubernetes.io/cluster-service=true,kubernetes.io/name=Heapster                   k8s-app=heapster     10.254.189.105   8082/TCP
kube-system   kube-dns              k8s-app=kube-dns,kubernetes.io/cluster-service=true,kubernetes.io/name=KubeDNS   k8s-app=kube-dns     10.254.237.18    53/UDP
                                                                                                                                                           53/TCP
kube-system   monitoring-grafana    kubernetes.io/cluster-service=true,kubernetes.io/name=monitoring-grafana         name=influxGrafana   10.254.225.111   3000/TCP
kube-system   monitoring-influxdb   kubernetes.io/cluster-service=true,kubernetes.io/name=monitoring-influxdb        name=influxGrafana   10.254.10.95     8083/TCP
                                                                                                                                                           8086/TCP
[root@wlan-cloudserver31 influxdb-test]# 

containers in the second node (172.27.8.214):

[root@wlan-cloudserver34 wlanuser]# docker ps -a
CONTAINER ID        IMAGE                                  COMMAND                CREATED             STATUS              PORTS               NAMES
e42663a65e5c        kubernetes/heapster:canary             "/heapster --vmodule   2 hours ago         Up 2 hours                              k8s_heapster.3a2d03b7_heapster-zr8zh_kube-system_0cccc30c-81de-11e5-a8b4-fa163e77e286_cc4296d5           
6a5a1c29a6d0        gcr.io/google_containers/pause:0.8.0   "/pause"               2 hours ago         Up 2 hours                              k8s_POD.e4cc795_heapster-zr8zh_kube-system_0cccc30c-81de-11e5-a8b4-fa163e77e286_d9525a22                 
68642b91b498        kubernetes/heapster_grafana:v2.1.0     "/bin/sh -c /run.sh"   2 hours ago         Up 2 hours                              k8s_grafana.c8bbb6fb_influxdb-grafana-h9rnx_kube-system_7801146f-81d8-11e5-a8b4-fa163e77e286_f74c6aad    
a1512b5069a1        kubernetes/heapster_influxdb:v0.5      "influxd --config /e   2 hours ago         Up 2 hours                              k8s_influxdb.5fc13a9e_influxdb-grafana-h9rnx_kube-system_7801146f-81d8-11e5-a8b4-fa163e77e286_85de60d3   
01a718cb0884        gcr.io/google_containers/pause:0.8.0   "/pause"               2 hours ago         Up 2 hours                              k8s_POD.ac4f2d56_influxdb-grafana-h9rnx_kube-system_7801146f-81d8-11e5-a8b4-fa163e77e286_695acc80        
16046af3c28b        redis                                  "/entrypoint.sh redi   3 hours ago         Up 3 hours                              k8s_master.1681ebfb_redis-master-rze58_default_92a91592-8138-11e5-a8b4-fa163e77e286_f487fb2d             
eee93caeeebb        gcr.io/google_containers/pause:0.8.0   "/pause"               3 hours ago         Up 3 hours                              k8s_POD.49eee8c2_redis-master-rze58_default_92a91592-8138-11e5-a8b4-fa163e77e286_559570d3            

Grafana dashboards

The Containers dashboard only show containers of the first node, and is residual: cont

also, The Kubernetes Cluster dashboard only show the first node (172.27.8.211): clut

By show series in Influxdb, I can find containers data of the second node (172.27.8.214), why only view the first node in Grafana? I can find some residual data in underly results also! image

vishh commented 8 years ago

cc @thucatebay

thucatebay commented 8 years ago

Here's the query that Grafana uses to fetch the list of nodes: select distinct(hostname) from \"memory/limit_bytes_gauge\" where time > now() - 10m

for namespace: select distinct(pod_namespace) from \"uptime_ms_cumulative\" where time > now() - 1m

for pod: select distinct(pod_name) FROM \"uptime_ms_cumulative\" where pod_namespace =~ /$namespace/ and time > now() - 1m

and container: select distinct(container_name) from \"uptime_ms_cumulative\" where pod_name =~ /$pod/ and \"pod_namespace\" =~ /$namespace/ and time > now() - 1m

You can change the time filter to go back further and pick up more data. These are defined under "Templating" section of the dashboard settings menu.

I'll look into whether it's possible to use the selected time filter instead of hardcoding.

JunejaTung commented 8 years ago

@thucatebay to now, I config the query in Grafana not very good! I make some chages to the node 172.27.8.212, make it ready. then find some interesting things . i doubt there may be bug when has node notready in kubelets or no up containers, in this scene, heapster can't get metrics for the ready nodes after that notready node.

[root@wlan-cloudserver31 influxdb-test]# 
[root@wlan-cloudserver31 influxdb-test]# kubectl get nodes
NAME                 LABELS                                      STATUS
172.27.8.211         kubernetes.io/hostname=172.27.8.211         Ready
172.27.8.212         kubernetes.io/hostname=172.27.8.212         Ready
172.27.8.214         kubernetes.io/hostname=172.27.8.214         Ready
[root@wlan-cloudserver31 influxdb-test]# kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                     READY     STATUS    RESTARTS   AGE       NODE
default       busybox                  1/1       Running   2          4d        172.27.8.211
default       redis-master-rze58       1/1       Running   2          6d        172.27.8.214
kube-system   heapster-zcd76           1/1       Running   0          4d        172.27.8.214
kube-system   influxdb-grafana-h9rnx   2/2       Running   0          6d        172.27.8.214
kube-system   kube-dns-v9-u07x6        4/4       Running   0          2d        172.27.8.211
[root@wlan-cloudserver31 influxdb-test]# 

co cl

the logs for container heapster.

I1104 11:53:30.007492       1 kubelet.go:99] url: "http://172.27.8.211:10255/stats/default/busybox/d22d309b-82c4-11e5-a8b4-fa163e77e286/busybox", body: "{\"num_stats\":60,\"start\":\"2015-11-04T11:53:25Z\",\"end\":\"2015-11-04T11:53:30Z\"}", data: {ContainerReference:{Name:/system.slice/docker-e1a52f29c1bd116ecf599527f6c0d9e08b625b6c4edc8f69d3d1568d5299e382.scope Aliases:[k8s_busybox.d1c8ce40_busybox_default_d22d309b-82c4-11e5-a8b4-fa163e77e286_13b42095 e1a52f29c1bd116ecf599527f6c0d9e08b625b6c4edc8f69d3d1568d5299e382] Namespace:docker} Subcontainers:[] Spec:{CreationTime:2015-11-04 11:23:30.664343161 +0000 UTC Labels:map[io.kubernetes.pod.name:default/busybox] HasCpu:true Cpu:{Limit:2 MaxLimit:0 Mask:0-7} HasMemory:true Memory:{Limit:18446744073709551615 Reservation:0 SwapLimit:18446744073709551615} HasNetwork:true HasFilesystem:false HasDiskIo:true HasCustomMetrics:false CustomMetrics:[]} Stats:[]}
I1104 11:53:30.054508       1 manager.go:175] completed scraping data from sources. Errors: []
I1104 11:53:35.000745       1 manager.go:162] starting to scrape data from sources start: 2015-11-04 11:53:30 +0000 UTC end: 2015-11-04 11:53:35 +0000 UTC
I1104 11:53:35.000915       1 manager.go:103] attempting to get data from source "Kube Node Metrics Source"
I1104 11:53:35.000940       1 manager.go:103] attempting to get data from source "Kube Events Source"
I1104 11:53:35.001114       1 kube.go:79] Only have PublicIP 172.27.8.214 for node 172.27.8.214, so using it for InternalIP
I1104 11:53:35.001104       1 kube_events.go:216] Fetched list of events from the master
I1104 11:53:35.001188       1 kube_events.go:217] []
I1104 11:53:35.001279       1 kube.go:79] Only have PublicIP 172.27.8.211 for node 172.27.8.211, so using it for InternalIP
I1104 11:53:35.001307       1 kube.go:79] Only have PublicIP 172.27.8.212 for node 172.27.8.212, so using it for InternalIP
I1104 11:53:35.001275       1 manager.go:103] attempting to get data from source "Kube Pods Source"
I1104 11:53:35.001323       1 kube_nodes.go:126] Fetched list of nodes from the master
I1104 11:53:35.001367       1 kube.go:79] Only have PublicIP 172.27.8.214 for node 172.27.8.214, so using it for InternalIP
I1104 11:53:35.001409       1 kube.go:79] Only have PublicIP 172.27.8.211 for node 172.27.8.211, so using it for InternalIP
I1104 11:53:35.001450       1 kube.go:79] Only have PublicIP 172.27.8.212 for node 172.27.8.212, so using it for InternalIP
I1104 11:53:35.001546       1 pods.go:152] selected pods from api server [{pod:0xc20830e5d0 nodeInfo:0xc2082bc140 namespace:0xc2081400e8} {pod:0xc20830e7c0 nodeInfo:0xc2082bc180 namespace:0xc2081400e8} {pod:0xc20830e000 nodeInfo:0xc2082bc1c0 namespace:0xc208140000} {pod:0xc20830e1f0 nodeInfo:0xc2082bc200 namespace:0xc208140000} {pod:0xc20830e3e0 nodeInfo:0xc2082bc240 namespace:0xc2081400e8}]
I1104 11:53:35.001927       1 kubelet.go:110] about to query kubelet using url: "http://172.27.8.214:10255/stats/kube-system/influxdb-grafana-h9rnx/7801146f-81d8-11e5-a8b4-fa163e77e286/influxdb"
I1104 11:53:35.002068       1 kubelet.go:110] about to query kubelet using url: "http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/etcd"
I1104 11:53:35.002145       1 kubelet.go:110] about to query kubelet using url: "http://172.27.8.211:10255/stats/default/busybox/d22d309b-82c4-11e5-a8b4-fa163e77e286/busybox"
I1104 11:53:35.002244       1 kubelet.go:110] about to query kubelet using url: "http://172.27.8.214:10255/stats/default/redis-master-rze58/92a91592-8138-11e5-a8b4-fa163e77e286/master"
I1104 11:53:35.002346       1 kubelet.go:110] about to query kubelet using url: "http://172.27.8.214:10255/stats/kube-system/heapster-zcd76/3a99af72-82ea-11e5-a8b4-fa163e77e286/heapster"
I1104 11:53:35.003034       1 kubelet.go:96] failed to get stats from kubelet url: http://172.27.8.214:10255/stats/kube-system/influxdb-grafana-h9rnx/7801146f-81d8-11e5-a8b4-fa163e77e286/influxdb - Get http://172.27.8.214:10255/stats/kube-system/influxdb-grafana-h9rnx/7801146f-81d8-11e5-a8b4-fa163e77e286/influxdb: dial tcp 172.27.8.214:10255: connection refused
I1104 11:53:35.003081       1 kube_pods.go:110] failed to get stats for container "influxdb" in pod "kube-system"/"influxdb-grafana-h9rnx"
I1104 11:53:35.003091       1 kube_nodes.go:59] Failed to get container stats from Kubelet on node "172.27.8.214"
I1104 11:53:35.003139       1 kubelet.go:110] about to query kubelet using url: "http://172.27.8.214:10255/stats/kube-system/influxdb-grafana-h9rnx/7801146f-81d8-11e5-a8b4-fa163e77e286/grafana"
I1104 11:53:35.003165       1 kubelet.go:96] failed to get stats from kubelet url: http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/etcd - Get http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/etcd: dial tcp 172.27.8.214:10255: connection refused
I1104 11:53:35.003200       1 kube_pods.go:110] failed to get stats for container "etcd" in pod "kube-system"/"kube-dns-v9-oaep5"
I1104 11:53:35.003221       1 kubelet.go:110] about to query kubelet using url: "http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/kube2sky"
I1104 11:53:35.003350       1 kubelet.go:96] failed to get stats from kubelet url: http://172.27.8.214:10255/stats/kube-system/influxdb-grafana-h9rnx/7801146f-81d8-11e5-a8b4-fa163e77e286/grafana - Get http://172.27.8.214:10255/stats/kube-system/influxdb-grafana-h9rnx/7801146f-81d8-11e5-a8b4-fa163e77e286/grafana: dial tcp 172.27.8.214:10255: connection refused
I1104 11:53:35.003416       1 kube_pods.go:110] failed to get stats for container "grafana" in pod "kube-system"/"influxdb-grafana-h9rnx"
I1104 11:53:35.003476       1 kubelet.go:96] failed to get stats from kubelet url: http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/kube2sky - Get http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/kube2sky: dial tcp 172.27.8.214:10255: connection refused
I1104 11:53:35.003515       1 kube_pods.go:110] failed to get stats for container "kube2sky" in pod "kube-system"/"kube-dns-v9-oaep5"
I1104 11:53:35.003532       1 kubelet.go:110] about to query kubelet using url: "http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/skydns"
I1104 11:53:35.003700       1 kubelet.go:96] failed to get stats from kubelet url: http://172.27.8.214:10255/stats/default/redis-master-rze58/92a91592-8138-11e5-a8b4-fa163e77e286/master - Get http://172.27.8.214:10255/stats/default/redis-master-rze58/92a91592-8138-11e5-a8b4-fa163e77e286/master: dial tcp 172.27.8.214:10255: connection refused
I1104 11:53:35.003767       1 kube_pods.go:110] failed to get stats for container "master" in pod "default"/"redis-master-rze58"
I1104 11:53:35.003774       1 kubelet.go:96] failed to get stats from kubelet url: http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/skydns - Get http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/skydns: dial tcp 172.27.8.214:10255: connection refused
I1104 11:53:35.003844       1 kube_pods.go:110] failed to get stats for container "skydns" in pod "kube-system"/"kube-dns-v9-oaep5"
I1104 11:53:35.003871       1 kubelet.go:110] about to query kubelet using url: "http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/healthz"
I1104 11:53:35.003986       1 kubelet.go:96] failed to get stats from kubelet url: http://172.27.8.214:10255/stats/kube-system/heapster-zcd76/3a99af72-82ea-11e5-a8b4-fa163e77e286/heapster - Get http://172.27.8.214:10255/stats/kube-system/heapster-zcd76/3a99af72-82ea-11e5-a8b4-fa163e77e286/heapster: dial tcp 172.27.8.214:10255: connection refused
I1104 11:53:35.004023       1 kube_pods.go:110] failed to get stats for container "heapster" in pod "kube-system"/"heapster-zcd76"
I1104 11:53:35.004200       1 kubelet.go:96] failed to get stats from kubelet url: http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/healthz - Get http://172.27.8.214:10255/stats/kube-system/kube-dns-v9-oaep5/06e1de3d-82c2-11e5-a8b4-fa163e77e286/healthz: dial tcp 172.27.8.214:10255: connection refused
I1104 11:53:35.004250       1 kube_pods.go:110] failed to get stats for container "healthz" in pod "kube-system"/"kube-dns-v9-oaep5"
I1104 11:53:35.007198       1 kube_nodes.go:59] Failed to get container stats from Kubelet on node "172.27.8.212"
I1104 11:53:35.010137       1 kubelet.go:99] url: "http://172.27.8.211:10255/stats/default/busybox/d22d309b-82c4-11e5-a8b4-fa163e77e286/busybox", body: "{\"num_stats\":60,\"start\":\"2015-11-04T11:53:30Z\",\"end\":\"2015-11-04T11:53:35Z\"}", data: {ContainerReference:{Name:/system.slice/docker-e1a52f29c1bd116ecf599527f6c0d9e08b625b6c4edc8f69d3d1568d5299e382.scope Aliases:[k8s_busybox.d1c8ce40_busybox_default_d22d309b-82c4-11e5-a8b4-fa163e77e286_13b42095 e1a52f29c1bd116ecf599527f6c0d9e08b625b6c4edc8f69d3d1568d5299e382] Namespace:docker} Subcontainers:[] Spec:{CreationTime:2015-11-04 11:23:30.664343161 +0000 UTC Labels:map[io.kubernetes.pod.name:default/busybox] HasCpu:true Cpu:{Limit:2 MaxLimit:0 Mask:0-7} HasMemory:true Memory:{Limit:18446744073709551615 Reservation:0 SwapLimit:18446744073709551615} HasNetwork:true HasFilesystem:false HasDiskIo:true HasCustomMetrics:false CustomMetrics:[]} Stats:[]}

and i try run heapster with the same config on another kubernetes , all the nodes are ready , has containers, the grafana can view whole containers and nodes.

thucatebay commented 8 years ago

I've also seen this from time to time where heapster isn't able to get metrics from a node (that is not ready, for example), and is just stuck there. A connection timeout is needed when heapster connects to kubelet.

vishh commented 8 years ago

cc @piosz @mwielgus

fejta-bot commented 6 years ago

Issues go stale after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Freeze the issue for 90d with /lifecycle frozen. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

DirectXMan12 commented 6 years ago

this is quite old. please re-open if you are still experiencing the issue