rancher / rancher

Complete container management platform
http://rancher.com
Apache License 2.0
23.39k stars 2.97k forks source link

[BUG]Migration to cri-dockerd breaks kubelet metrics endpoint on v1.23.6 #39820

Open Heiko-san opened 1 year ago

Heiko-san commented 1 year ago

Rancher Server Setup

Information about the Cluster

User Information

Describe the bug This only happens on v1.23.6 and seems to be fixed with v1.24.4 (on the same machine os).

After enabling the "cri-dockerd" switch for Downstream clusters, kubelet's /metrics endpoint needs about 45s to return, causing the metric-server call to timeout. This leads to the metric-server not fully starting up/getting ready. And of course the related metrics aren't available.

However the other scrape endpoints (/metrics/cadvisor, /metrics/probes) seem to work just fine/fast.

...
E1208 14:29:32.945822       1 scraper.go:140] "Failed to scrape node" err="Get \"https://1.2.3.4:10250/metrics/resource\": context deadline exceeded" node="mynode1"
E1208 14:29:32.945825       1 scraper.go:140] "Failed to scrape node" err="Get \"https://1.2.3.4:10250/metrics/resource\": context deadline exceeded" node="mynode2"
I1208 14:29:37.149194       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
kubectl -n kube-system get pod 
NAME                                      READY   STATUS    RESTARTS   AGE
calico-kube-controllers-857854d74-fn2bd   1/1     Running   0          25h
canal-dz4dj                               2/2     Running   0          25h
canal-ls7sp                               2/2     Running   0          25h
canal-nfmmq                               2/2     Running   0          25h
canal-pdzz7                               2/2     Running   0          25h
canal-pqc24                               2/2     Running   0          25h
canal-sqbfb                               2/2     Running   0          25h
canal-tz79d                               2/2     Running   0          26h
coredns-548ff45b67-df2sj                  1/1     Running   0          25h
coredns-548ff45b67-t6mt5                  1/1     Running   0          25h
coredns-autoscaler-d5944f655-86bh4        1/1     Running   0          25h
metrics-server-5456dc796f-k66hg           0/1     Running   0          20h

To Reproduce Enable cri-dockerd in a rke1 Downstream Cluster with K8s v1.23.6 and have a look into the metric-server logs (or try calling kubelet's /metrics endoint with the service account token from metrics-server).

Heiko-san commented 1 year ago

Actually it seems, this issue also appears in v1.24.4 and v1.24.9. It seems it just took a while, and it also doesn't appear on all clusters. We don't have a clue what is triggering this, yet. But it seems restarting kubelet remediates this for a while.