1 minute slower since kubernets version 1.20.3 - [kubelet-check] Initial timeout of 40s passed.

medyagh commented 3 years ago

Hello, on minikube we updated our default kubernetes to v1.20.4 and we noticed bumping from v1.20.2 to v1.20.3 causes 1 minute slow down on miikube which the logs shows it is in kubeadm init https://github.com/kubernetes/minikube/issues/10545


    ▪ Booting up control plane ...I0220 15:44:24.522119   82905 command_runner.go:123] > [control-plane] Creating static Pod manifest for "kube-apiserver"
I0220 15:44:24.522344   82905 command_runner.go:123] > [control-plane] Creating static Pod manifest for "kube-controller-manager"
I0220 15:44:24.522442   82905 command_runner.go:123] > [control-plane] Creating static Pod manifest for "kube-scheduler"
I0220 15:44:24.522659   82905 command_runner.go:123] > [etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0220 15:44:24.522894   82905 command_runner.go:123] > [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
| I0220 15:45:04.513403   82905 command_runner.go:123] > [kubelet-check] Initial timeout of 40s passed.
/ I0220 15:45:28.017602   82905 command_runner.go:123] > [apiclient] All control plane components are healthy after 63.506128 seconds
I0220 15:45:28.017909   82905 command_runner.go:123] > [upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
I0220 15:45:28.031480   82905 command_runner.go:123] > [kubelet] Creating a ConfigMap "kubelet-config-1.20" in namespace kube-system with the configuration for the kubelets in the cluster
- I0220 15:45:28.557112   82905 command_runner.go:123] > [upload-certs] Skipping phase. Please see --upload-certs
I0220 15:45:28.557337   82905 command_runner.go:123] > [mark-control-plane] Marking the node minikube as control-plane by adding the labels "node-role.kubernetes.io/master=''" and "node-role.kubernetes.io/control-plane='' (deprecated)"
| I0220 15:45:29.070246   82905 command_runner.go:123] > [bootstrap-token] Using token: ckmqq4.dov5m97q5ko44fpg
I0220 15:45:29.159638   82905 out.go:140]     ▪ Configuring RBAC rules ..

is there a new config or somethign that we are missing out int he v1.20.3 that is adding an additional check that causes 1 minute tax ?

screenshot of our performance dashboard

neolit123 commented 3 years ago

hi, @medyagh

| I0220 15:45:04.513403 82905 command_runner.go:123] > [kubelet-check] Initial timeout of 40s passed

this indicates the kubelet takes a while to become ready.

i don't think we have anything on the kubeadm side that can cause it, so it might be a core issue. i will investigate.

neolit123 commented 3 years ago

i can confirm the problem in our CI (kind(er) based doing containerd-in-docker in a prow pod):

this is using a 1.20.4 kubeadm with a 1.20.4 kubelet.

I0220 17:01:12.877602 169 request.go:943] Got a Retry-After 1s response for attempt 8 to https://172.17.0.7:6443/healthz?timeout=10s I0220 17:01:13.878699 169 round_trippers.go:445] GET https://172.17.0.7:6443/healthz?timeout=10s in 0 milliseconds I0220 17:01:13.878760 169 request.go:943] Got a Retry-After 1s response for attempt 9 to https://172.17.0.7:6443/healthz?timeout=10s [kubelet-check] Initial timeout of 40s passed.

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-1-20/1363170590768238592/build-log.txt

this is the whole diff of k/k between 1.20.2 and 1.20.3: https://github.com/kubernetes/kubernetes/compare/v1.20.2...v1.20.3

there are kubeadm commits that change the default image repository for CI images, which doesn't seem related. i do see a few kubelet changes:

i will now try a 1.20.3 kubeadm with a 1.20.2 kubelet and bisect.

neolit123 commented 3 years ago

this is also a problem in the tip of 1.21 and 1.19:

[kubelet-check] Initial timeout of 40s passed.

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-latest/1363270239839588352/build-log.txt

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-1-19/1363254137684234240/build-log.txt

so we must have backported something that causes this.

looking at logs and diffs this started happening on the 13th of Jan. my current suspect commit is: https://github.com/kubernetes/kubernetes/commit/88ee6d777fa343bfeec0cf5b7b0e83ef0bd8321d

dims commented 3 years ago

@SergeyKanzhelev @ehashman please see a possible regression

neolit123 commented 3 years ago

before https://github.com/kubernetes/kubernetes/commit/88ee6d777fa343bfeec0cf5b7b0e83ef0bd8321d

[apiclient] All control plane components are healthy after 26.046012 seconds

after:

[apiclient] All control plane components are healthy after 69.034837 seconds

the PR that did this is: https://github.com/kubernetes/kubernetes/pull/94087

so i can see this wait making sense, but it should bail out if this is the first node and there is no API server yet.

neolit123 commented 3 years ago

cc @BenTheElder for kind

neolit123 commented 3 years ago

i have commented on the original PR.

liggitt commented 3 years ago

for a self-hosting kubelet, informer sync will fail until the API server is available. Does the delay mean that it takes 40 seconds to start the self-hosted API server? or does it mean that GetNode() is called 4 times and times out in the process of starting static pods?

neolit123 commented 3 years ago

or does it mean that GetNode() is called 4 times and times out in the process of starting static pods?

i suspect that something like that is happening.

the base of the time comparison is a check in kubeadm where it asks both the kubelet and the apiserver to report 200 on their /healthz. normally the kubelet readies up within seconds and then it starts the apiserver pod. before the change, this completes in ~26 seconds.

after this change the kubelet /healthz takes more than 40 seconds to 200 as it waits for the apiserver for the informer sync.

brandond commented 3 years ago

So just to be clear, the apiserver still starts up just as quickly, it's just that the kubelet now does not return 200 on healthz until the apiserver is also up?

neolit123 commented 3 years ago

So just to be clear, the apiserver still starts up just as quickly, it's just that the kubelet now does not return 200 on healthz until the apiserver is also up?

seems that both the kubelet and api-server healthz are delayed. also yesterday i did not add sufficient logging in the kubelet to figure out whats going on.

logs from kubeadm:

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0220 17:00:34.356861     169 round_trippers.go:445] GET https://172.17.0.7:6443/healthz?timeout=10s  in 1 milliseconds
...
[kubelet-check] Initial timeout of 40s passed.
...
I0220 17:01:43.390484     169 round_trippers.go:445] GET https://172.17.0.7:6443/healthz?timeout=10s 200 OK in 13 milliseconds
I0220 17:01:43.390615     169 uploadconfig.go:108] [upload-config] Uploading the kubeadm ClusterConfiguration to a ConfigMap
[apiclient] All control plane components are healthy after 69.035888 seconds

kubelet starts around 17:00:34.
[kubelet-check] Initial timeout of 40s passed. means that the kubelet healthz is not 200 after 40 seconds.
API server reports 200 at 17:01:43.
we don't log the kubelet 200, but it must have happened at some point.

neolit123 commented 3 years ago

i have a WIP proposal fix for the kubelet. will try to send later.

dims commented 3 years ago

cc @adisky

ehashman commented 3 years ago

I haven't noticed any substantial delays in cluster creation on Kubernetes since this patch went in, is it possible this is minikube-specific and should be fixed on minikube bootstrap?

The patch in https://github.com/kubernetes/kubernetes/pull/94087 is a correctness fix to avoid nodes admitting pods before they have fully synced pod state. We've seen many issues with pods getting stuck in NodeAffinity status after cold-starting clusters without this patch because the kubelet starts admitting pods before it is ready and .

so i can see this wait making sense, but it should bail out if this is the first node and there is no API server yet.

I don't think the kubelet has any way of knowing it's the first node? Are you suggesting a change in k8s or kubeadm?

Edit: oops I thought I hit send on this comment 3h ago but apparently not :sweat_smile:

neolit123 commented 3 years ago

I haven't noticed any substantial delays in cluster creation on Kubernetes since this patch went in

what cluster creation tooling / e2e test jobs are you checking this against?

is it possible this is minikube-specific and should be fixed on minikube bootstrap?

this patch definitely creates a problem if the api-server is managed as a static pod. so it's not only kubeadm specific (or for anything that uses kubeadm such as minikube)

The patch in kubernetes/kubernetes#94087 is a correctness fix to avoid nodes admitting pods before they have fully synced pod state.

i do not disagree with the intent, i just think the implementation can be improved.

neolit123 commented 3 years ago

fix attempt at: https://github.com/kubernetes/kubernetes/pull/99336

ehashman commented 3 years ago

What I'm confused by is why this would affect bootstrap at all. For a standalone kubelet, the checks are always bypassed: https://github.com/kubernetes/kubernetes/pull/94087/files#diff-67dd9e8f3ee257072765326cb4f242852554a2c0753563fa51e292c0a63a7b94R455-R460

what cluster creation tooling / e2e test jobs are you checking this against?

OpenShift and upstream Kubernetes. I don't have specific jobs to point you to though, idk if there's an easy way to gather stats on cluster start time from testgrid.

liggitt commented 3 years ago

"self hosted" is not always standalone. kubeadm sets up a kubelet which runs the apiserver and then registers itself as a node to the API server it started

BenTheElder commented 3 years ago

In particular for development clusters like kind/minikube etc. using kubeadm it's common to have:

one node
running self-hosted api-server static pod
registered kubelet (and control-plane node taint dropped so workloads can schedule)

But as Jordan mentioned even kubeadm clusters with N nodes have static pod apiserver on certain nodes + registered kubelets for all nodes. This will slow down bootstrap for pretty much any kubeadm cluster since most work is blocked on healthy API server / first node.

idk if there's an easy way to gather stats on cluster start time from testgrid.

There is data of sorts, just not what you want probably. Jobs running under kubetest (GCE) record the Up step (cluster bringup) time, but that includes more than just the core cluster bringup. IIRC kind jobs log time for bootstrap and more in the logs, but not to junit currently (pending rewriting how we run those tests).

Local cluster jobs (local-up-cluster.sh, kind) are currently impacted by CI/prow.k8s.io I/O throttling issues impacting pod startup, build times, etc. badly in general and wouldn't make useful performance data until that's resolved.

brandond commented 3 years ago

"self hosted" is not always standalone. kubeadm sets up a kubelet which runs the apiserver and then registers itself as a node to the API server it started

Our RKE2 project does the same thing. Runs etcd plus apiserver and sundry as static pods; the kubelet registers to the local apiserver once it's up.

fabriziopandini commented 3 years ago

FYI, got pinged about this issue by some Cluster API users...

neolit123 commented 3 years ago

update

PR in master (1.22) merged: https://github.com/kubernetes/kubernetes/pull/99336

the change is backported all the way back to 1.18 (inc); backport PRs are pending. https://github.com/kubernetes/kubernetes/pull/99336#issuecomment-824441152

neolit123 commented 3 years ago

PRs merged.

neolit123 commented 3 years ago

(i guess the 1.21 PR has not merged yet, but is about to.)

you can expect new releases after the 12th of May.

kubernetes / kubeadm

1 minute slower since kubernets version 1.20.3 - [kubelet-check] Initial timeout of 40s passed. #2395

screenshot of our performance dashboard