kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
107.27k stars 38.55k forks source link

kubeadm 1.6.0 (only 1.6.0!!) is broken due to unconfigured CNI making kubelet NotReady #43815

Closed jbeda closed 7 years ago

jbeda commented 7 years ago

Initial report in https://github.com/kubernetes/kubeadm/issues/212.

I suspect that this was introduced in https://github.com/kubernetes/kubernetes/pull/43474.

What is going on (all on single master):

  1. kubeadm starts configures a kubelet and uses static pods to configure a control plane
  2. kubeadm creates node object and waits for kubelet to join and be ready
  3. kubelet is never ready and so kubeadm waits forever

In the conditions list for the node:

  Ready         False   Wed, 29 Mar 2017 15:54:04 +0000     Wed, 29 Mar 2017 15:32:33 +0000     KubeletNotReady         runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Previous behavior was for the kubelet to join the cluster even with unconfigured CNI. The user will then typically run a DaemonSet with host networking to bootstrap CNI on all nodes. The fact that the node never joins means that, fundamentally, DaemonSets cannot be used to bootstrap CNI.

Edit by @mikedanese: please test patched debian amd64 kubeadm https://github.com/kubernetes/kubernetes/issues/43815#issuecomment-290616036 with fix

jbeda commented 7 years ago

/cc @lukemarsden @luxas @mikedanese

jbeda commented 7 years ago

If we revert #43474 completely, we are in a situation again where we break 0.2.0 CNI plugins (see https://github.com/kubernetes/kubernetes/issues/43014)

Should we consider doing something like https://github.com/kubernetes/kubernetes/pull/43284?

Also /cc @thockin

jbeda commented 7 years ago

/cc @kubernetes/sig-network-bugs

yujuhong commented 7 years ago

Ref: https://github.com/kubernetes/kubernetes/issues/43397#issuecomment-289978351

dcbw commented 7 years ago

@jbeda can I get some kubelet logs with --loglevel=5?

jbeda commented 7 years ago

@yujuhong -- you mention that you think that this is working as intended. Regardless, kubeadm was depending on this behavior. We introduced a breaking change with #43474. We can talk about the right way to fix this for 1.7 but, for now, we need to get kubeadm working again.

jbeda commented 7 years ago

Slack discussion ongoing now -- https://kubernetes.slack.com/archives/C09QYUH5W/p1490803144368246

jbeda commented 7 years ago

It looks like DaemonSets will still get scheduled even if the node is not ready. This is really, in this case, kubeadm being a little too paranoid.

The current plan that we are going to test out is to have kubeadm no longer wait for the master node to be ready but instead just have it be registered. This should be good enough to let a CNI DaemonSet be scheduled to set up CNI.

@kensimon is testing this out.

dcbw commented 7 years ago

@jbeda yeah, looks like the DaemonSet controller will still enqueue them mainly because it's completely ignorant of network-iness. We should really fix this more generally. Is there anything immediate to do in kube or is it all in kubeadm for now?

luhkevin commented 7 years ago

I'm trying to install kubernetes with kubeadm on Ubuntu 16.04. Is there a quick fix for this?

stevenbower commented 7 years ago

@jbeda if you have a patched version happy to test it..

kensimon commented 7 years ago

I have kubeadm getting past the node's NotReady status, but the dummy deployment it creates isn't working due to the node.alpha.kubernetes.io/notReady taint preventing it from running. Adding tolerations doesn't seem to help, I'm not exactly sure how to proceed at this point. Can anybody shed some light on how to deploy a pod that tolerates the notReady taint?

I'm exploring some other options like not marking the node as notReady, but it's not clear that's what we want to do.

sbezverk commented 7 years ago

we worked around it by removing KUBELET_NETWORK_ARGS from kubelet command line. after that kubeadm init worked fine and we were able to install canal cni plugin.

overip commented 7 years ago

@sbezverk would you please describe how to do that?

coeki commented 7 years ago

Can confirm @sbezverk (good find :) ) findings, adjusting /etc/systemd/system/10-kubeadm.conf and removing KUBELET_NETWORK_ARGS will make it run on centos. Tested with weave.

sbezverk commented 7 years ago

@overip you need to edit /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_EXTRA_ARGS

remove $KUBELET_NETWORK_ARGS

and then restart kubelet after that kubeadm init should work.

jp557198 commented 7 years ago

this is what i did

kubeadm reset

remove ENV entries from:

/etc/systemd/system/kubelet.service.d/10-kubeadm.conf

reload systemd and kube services

systemctl daemon-reload systemctl restart kubelet.service

re-run init

kubeadm init

coeki commented 7 years ago

All correct, and while we're at it

If you see this: kubelet: error: failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"

you have to edit your /etc/systemd/system/kubelet.service.d/10-kubeadm.conf and add the flag --cgroup-driver="systemd"

and do as above

kuebadm reset systemctl daemon-reload systemctl restart kubelet.service kubeadm init.

kensimon commented 7 years ago

I'd be careful removing --network-plugin=cni from the kubelet CLI flags, this causes kubelet to use the no_op plugin by default... I would be surprised if common plugins like calico/weave would even work in this case (but then again my understanding of how these plugins operate underneath is a bit limited.)

sbezverk commented 7 years ago

@kensimon hm, have not seen any issues on my setup, I deployed canal cni plugin and it worked fine..

resouer commented 7 years ago

@sbezverk Is cross host networking also working well?

sbezverk commented 7 years ago

@resouer cannot confirm, I have 1.6.0 only as All-In-One.

coeki commented 7 years ago

@resouer @sbezverk I succesfully joined a machine.

 [root@deploy-01 x86_64]# kubectl get nodes
 NAME        STATUS    AGE       VERSION
 deploy-01   Ready     51m       v1.6.0
 master-01   Ready     4m        v1.6.0
     NAME                                    READY     STATUS    RESTARTS   AGE
etcd-deploy-01                          1/1       Running   0          50m
kube-apiserver-deploy-01                1/1       Running   0          51m
kube-controller-manager-deploy-01       1/1       Running   0          50m
kube-dns-3913472980-6plgh               3/3       Running   0          51m
kube-proxy-mbvdh                        1/1       Running   0          4m
kube-proxy-rmp36                        1/1       Running   0          51m
kube-scheduler-deploy-01                1/1       Running   0          50m
kubernetes-dashboard-2396447444-fm8cz   1/1       Running   0          24m
weave-net-3t487                         2/2       Running   0          44m
weave-net-hhcqp                         2/2       Running   0          4m
stevenbower commented 7 years ago

workaround works but can't get flannel going...

sbezverk commented 7 years ago

@stevenbower worst case scenario, you can put back this setting and restart kubelet when you are done with kubeadm business..

webwurst commented 7 years ago

I got a three node cluster with weave working. Not sure how stable this might be with this hack, but thanks anyway! :smiley:

coeki commented 7 years ago

On a side node, you can put back the $KUBELET_NETWORK_ARGS, after the init on the master passes. I actually did not remove it on the machine I joined, only the cgroup-driver, otherwise kubelet and docker won't work together.

But you don't have to kubeadm reset, just change /etc/systemd/system/kubelet.service.d/10-kubeadm.conf and do the systemctl dance:

systemctl daemon-reload systemctl restart kubelet.service

kensimon commented 7 years ago

To those of you who are dropping KUBELET_NETWORK_ARGS and reporting it works for you:

For me, I have 2 clusters up: one where I applied the patch from https://github.com/kubernetes/kubernetes/pull/43824 and let kubeadm proceed normally on initialization, and one with KUBELET_NETWORK_ARGS deleted. On the cluster with KUBELET_NETWORK_ARGS removed, any traffic between pods fails.

On a cluster with KUBELET_NETWORK_ARGS removed:

$ kubectl run nginx --image=nginx
deployment "nginx" created
$ kubectl expose deployment nginx --port 80
service "nginx" exposed
$ kubectl run --rm -i -t ephemeral --image=busybox -- /bin/sh -l
If you don't see a command prompt, try pressing enter.
/ # wget nginx.default.svc.cluster.local
wget: bad address 'nginx.default.svc.cluster.local'

On a cluster with normal KUBELET_NETWORK_ARGS but with a patched kubeadm:

$ kubectl run nginx --image=nginx          
deployment "nginx" created
$ kubectl expose deployment nginx --port 80
service "nginx" exposed
$ kubectl run --rm -i -t ephemeral --image=busybox -- /bin/sh -l
If you don't see a command prompt, try pressing enter.
/ # wget nginx.default.svc.cluster.local
Connecting to nginx.default.svc.cluster.local (10.109.159.41:80)
index.html           100% |********************************************************************************************|   612   0:00:00 ETA

If you're one of those who disabled KUBELET_NETWORK_ARGS, check if the above works for you.

mikedanese commented 7 years ago

I suggest that we drop both the node ready and the dummy deployment check altogether for 1.6 and move them to a validation phase for 1.7.

On Mar 29, 2017 10:13 AM, "Dan Williams" notifications@github.com wrote:

@jbeda https://github.com/jbeda yeah, looks like the DaemonSet controller will still enqueue them mainly because it's completely ignorant of network-iness. We should really fix this more generally. Is there anything immediate to do in kube or is it all in kubeadm for now?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/43815#issuecomment-290158416, or mute the thread https://github.com/notifications/unsubscribe-auth/ABtFIUw8GIJVfHrecB3qpTLT8Q4AyLVjks5rqpFKgaJpZM4MtMRe .

LaurentDumont commented 7 years ago

Anyone else running Ubuntu 16.04? I've removed the KUBELET_NETWORK_ARGSfrom the systemd service and reloaded the systemd daemon. I can turnup a master node but cannot join a node. It fails with the error The requested resource is unavailable

yujuhong commented 7 years ago

KUBELET_NETWORK_ARGS removed, any traffic between pods fails.

I wouldn't be surprised since KUBELET_NETWORK_ARGS tells kubelet what plugin to use, where to look for the configuration and binaries. You NEED them.

mikedanese commented 7 years ago

I recommend #43835 (and the 1.6 cherry pick #43837) as the fix we make for 1.6. I tested both and they work. I've assigned @jbeda and @luxas to review when they wake up.

jbeda commented 7 years ago

Both of those PRs look reasonable. ButI think we should look at going with https://github.com/kubernetes/kubernetes/pull/43824 instead. While it is a bit more complicated, it does preserve that code path so that users that preconfigure CNI outside of using a Daemonset (I do this in https://github.com/jbeda/kubeadm-gce-tf though I haven't updated it to 1.6) still wait for nodes to be ready.

As a bonus, this is @kensimon's first PR to Kubernetes and he has pulled out the stops to test this stuff out. But, to be honest, they are both workable and I really want to see it fixed. :)

mikedanese commented 7 years ago

Sorry I missed https://github.com/kubernetes/kubernetes/pull/43824. I'm also happy with either if they both work.

WIZARD-CXY commented 7 years ago

I'm also happy with either if they both work too

webwurst commented 7 years ago

@kensimon It works for me, when I only disable KUBELET_NETWORK_ARGS during kubadm init. Thanks to you instructions I could verify that.

coeki commented 7 years ago

Confirmed @webwurst for it working when you only disable KUBELET_NETWORK_ARGS during kubadm init. I had to restart kube-dns though for it to pick it up. The check from @kensimon works, dns resolves.

Although I agree that this a terrible hack, and too horrible for most people to follow, looking at the slack channels.

A better solution is presented by the patches from @kensimon or @mikedanese.

patte commented 7 years ago

@coeki how exactly did you restart kube-dns. I tried kubectl delete pod kube-dns-3913472980-l771x --namespace=kube-system and now kube-dns stays on pending kube-system kube-dns-3913472980-7jrwm 0/3 Pending 0 4m It did exactly as described: remove KUBELET_NETWORK_ARGS, sudo systemctl daemon-reload && sudo systemctl restart kubelet.service, kubeadm init, add KUBELET_NETWORK_ARGS, again sudo systemctl daemon-reload && sudo systemctl restart kubelet.service But then my master stays in NotReady. In describe I get

Conditions:
  Type          Status  LastHeartbeatTime           LastTransitionTime          Reason              Message
  ----          ------  -----------------           ------------------          ------              -------
...
KubeletNotReady         runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

I tried the kube-dns restart as described above, but no success. Any idea? I'm dying on this, trying to get our cluster running again after a failed 1.6.0 upgrade this morging :(

coeki commented 7 years ago

@patte So I just delete pods kube-dns-3913472980-3ljfx -n kube-system And then kube-dns comes up again.

Did you after kubeadm init, add KUBELET_NETWORK_ARGS, again sudo systemctl daemon-reload && sudo systemctl restart kubelet.service install a pod network, like weave or calico? Add that first, you should be able to get it to work.

I tried and tested on centos7 and just did it on ubuntu/xenial, so should work.

To recap what I did:

remove KUBELET_NETWORK_ARGS sudo systemctl daemon-reload && sudo systemctl restart kubelet.service

kubeadm init --token=$TOKEN --apiserver-advertise-address=$(your apiserver ip address)

add KUBELET_NETWORK_ARGS again sudo systemctl daemon-reload && sudo systemctl restart kubelet.service

kubectl apply -f https://git.io/weave-kube-1.6

Joined a machine, typically have to add a static route to 10.96.0.0 (cluster ip) as I am on vagrant. Use with the $TOKEN, but this step is extra.

Then:

delete pods kube-dns-3913472980-3ljfx -n kube-system

Wait for it

kubectl run nginx --image=nginx kubectl expose deployment nginx --port 80 kubectl run --rm -i -t ephemeral --image=busybox -- /bin/sh -l / # wget nginx.default.svc.cluster.local Connecting to nginx.default.svc.cluster.local (10.101.169.36:80) index.html 100% |***********************************************************************| 612 0:00:00 ETA

Works for me, although it's a horrible hack ;)

sbezverk commented 7 years ago

I am really surprised that kubernetes development community has not provided any ETA for an official fix. I mean this is a horrible bug which should be easily get caught during the code testing. Since it has not, at the very least, 1.6.1 should be pushed asap with the fix so people would stop hacking their clusters and start doing productive things ;). Am I wrong here?

thockin commented 7 years ago

Hey all,

I was a little distracted this week, and this is a long thread full of kubeadm stuff I don't know that well. Can someone summarize for me? I think I get the gist of the bug, but what are the proposed solutions and what makes them horrible?

On Thu, Mar 30, 2017 at 8:13 AM, Serguei Bezverkhi <notifications@github.com

wrote:

I am really surprised that kubernetes development community has not provided any ETA for an official fix. I mean this is a horrible bug which should be easily get caught during the code testing. Since it has not, at the very least, 1.6.1 should be pushed asap with the fix so people would stop hacking their clusters and start doing productive things ;). Am I wrong here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/43815#issuecomment-290442315, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVEoKuUf28VazmUApsnyfGhuAhZIqks5rq8aLgaJpZM4MtMRe .

mikedanese commented 7 years ago

A change in kubelet (#43474) caused kubelet to start correctly reporting network not ready before the cni plugin was initialized. This broke some ordering that we were depending on and caused a deadlock in kubeadm master initialization. We didn't catch it because kubeadm e2e tests had been broken for a few days leading up to this change.

Current proposed fixes are #43824 and #43835.

thockin commented 7 years ago

ok, That was what I understood. The interlock between network plugin coming up and node readiness is a little awful right now.

On Thu, Mar 30, 2017 at 8:28 AM, Mike Danese notifications@github.com wrote:

A change in kubelet (#43474 https://github.com/kubernetes/kubernetes/pull/43474) caused kubelet to start correctly reporting network not ready before the cni plugin was initialized. This broke some ordering that we were depending on and caused a deadlock in kubeadm master initialization. We didn't catch it because kubeadm e2e tests had been broken for a few days leading up to this change.

Current proposed fixes are #43824 https://github.com/kubernetes/kubernetes/pull/43824 and #43835 https://github.com/kubernetes/kubernetes/pull/43835.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/43815#issuecomment-290447309, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVLzQLxeV6qp1Rw1fNALDDaf-Sktyks5rq8o2gaJpZM4MtMRe .

mikedanese commented 7 years ago

I still prefer #43835. It's a simpler change, I don't think the e2e checks should be done where they are, and there are reports of #43824 not working still. I'm going to push to get this resolved today.

sbezverk commented 7 years ago

+1 to resolve it today as lots of efforts are wasted on dealing with collateral from the workaround.

Dieken commented 7 years ago

I can't believe nobody really tried kubeadm 1.6.0 before 1.6.0 was released....

And, kubelet 1.5.6 + kubeadm 1.5.6 are also broken, /etc/systemd/system/kubelet.service.d/10-kubeadm.conf references /etc/kubernetes/pki/ca.crt but kubeadm doesn't generate ca.crt, there is ca.pem although.

Currently 1.6.0 and 1.5.6 are the only left releases in k8s apt repository...

Dieken commented 7 years ago

"broken out of the box", words learned today.

yujuhong commented 7 years ago

I still prefer #43835. It's a simpler change, I don't think the e2e checks should be done where they are, and there are reports of #43824 not working still. I'm going to push to get this resolved today.

Agree with Mike on this one. #43835 is the simpler change, and validation (if needed) can be done in a separate phase.

dcbw commented 7 years ago

@thockin we really need finer-grained conditions and status for networking, especially with hostNetwork:true. Nodes can be ready for some pods, but not others.

We can't use nodeNetworkUnavailable, becuase that's specific to cloud providers. We probably need another one, or a way for the scheduler to allow hostNetwork:true pods on nodes with NetworkReady:false, or making the taints work for unready nodes. And working kubeadm e2e tests :(

thockin commented 7 years ago

Agree. I have been delaying the problem because I had no great answers, but we need to get this in 1.7

On Thu, Mar 30, 2017 at 10:02 AM, Dan Williams notifications@github.com wrote:

@thockin https://github.com/thockin we really need finer-grained conditions and status for networking, especially with hostNetwork:true. Nodes can be ready for some pods, but not others.

We can't use nodeNetworkUnavailable, becuase that's specific to cloud providers. We probably need another one, or a way for the scheduler to allow hostNetwork:true pods on nodes with NetworkReady:false, or making the taints work for unready nodes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/43815#issuecomment-290475480, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVAaO9c76_R8me9PDo96AQ1ZrplMxks5rq-A1gaJpZM4MtMRe .