kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.99k stars 4.65k forks source link

New nodes added to cluster are not able to server pods #2923

Closed doronoffir closed 6 years ago

doronoffir commented 7 years ago

Hi Guys,

We have a k8s 1.6.2 cluster, running on AWS. We have used KOPS 1.6.2 to create the cluster.

We have several node pools for the different application roles, as DBs, workers and proxies. We are also using spot instances for the workers, managed by Spotinst with their customized k8s Autoscaling pod. We have made our tests on k8s staging cluster of version 1.5.6 and 1.6.0, and we did not encounter the issue I'll describe. The issue presented itself in the production k8s cluster (k8s 1.6.2). Due to some performance issues at the beginning we have scaled out the cluster, to a point it had almost 120 nodes, thats when the problems started.

Expected Behavior

The workers node pool is configured is consistent of spot instances, managed by Spotinst services, there is a customize Autoscaller pod that send scaling notice to spotinst. When a new node is added to the pool, it should start servicing the cluster resource need.

Current Behavior

About 40% of a newly added nodes seems to have Flannel issue. The node is connected to the cluster, and reported healthy by the cluster, but when new pods are been registered to it, they are stuck in "ContainerCreating". Examining the pods log show the following error:

Warning FailedSync Error syncing pod, skipping: failed to "CreatePodSandbox" for "analyzer-315055607-kdn0x_default(ac80c569-670c-11e7-bcbc-0a7044fda3e6)" with CreatePodSandboxError: "CreatePodSandbox for pod \"analyzer-315055607-kdn0x_default(ac80c569-670c-11e7-bcbc-0a7044fda3e6)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"analyzer-315055607-kdn0x_default\" network: open /run/flannel/subnet.env: no such file or directory"

Restarting/Delete the pod did not solved this.

Possible Solution

Our initial mitigation was reboot the node with the problem, and it helped at the first few nodes, but not for most of the failed nodes; We have added a large number of nodes at once, about 50, and we had about 20 failed nodes. Our next step was terminating the failed nodes, this had similar effect, it helped for some, but in most cases, the new node had the same issue. To cut a long story short, we have ended with the workaround of restarting the Flannel pod in the failed nodes, and this solve the problem.

Steps to Reproduce (for bugs)

For us it would be adding a batch of 30 servers to the cluster, at least 10 of them will present this issue.

Context

Since we need to monitor the nodes creation, I can not really trust the Autoscaller or any other "healing" procedure that require a node restart or replacement.

Your Environment

A k8s cluster version 1.6.2, created by KOPS 1.6.2, running on AWS.

Thank you!

erez-rabih commented 7 years ago

This issue is probably because of this flannel issue: https://github.com/coreos/flannel/issues/719

It was resolved on flannel v0.8.0 with this PR https://github.com/coreos/flannel/pull/729

These are the steps that helped us resolve this issue: Delete the flannel daemon set Take https://github.com/kubernetes/kops/blob/1.7.0/upup/models/cloudup/resources/addons/networking.flannel/k8s-1.6.yaml and replace the image version there Create the new daemons set Terminate and bring up all cluster nodes

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten /remove-lifecycle stale

fejta-bot commented 6 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close