Open chrisz100 opened 5 years ago
/area networking
We are facing an issue when you have burst loads of pods spawning (like for example with Airflow), you can end up with nodes which are spawned by the autoscaler which don't run a kube-proxy when the max pod limit on the node is already reached.
This happens in combination with aws-vpc-cni
To my understanding, kube-proxy
is defined in a static manifest file and the pod spec has the scheduler.alpha.kubernetes.io/critical-pod
annotation. This annotation is used by the kubelet
to check if the pod is critical and, if so, the kubelet uses the CriticalPodAdmissionHandler
to try to make room for the pod in case all resources are full.
If the HandleAdmissionFailure
fails for any reason, kube-proxy
pod will be never started on the node, because there will be no retry since pods defined in a manifest file run "outside of a reconcile loop" that will eventually retry it. Is my understanding correct?
I'm asking because we're investigating an issue we've got on a node, where kube-proxy
failed to start due to lack of CPU/mem resources, the HandleAdmissionFailure
triggered, but it failed to kill one of the pods, thus interrupting forever the kube-proxy
pod admission.
I've also opened this issue on Kubernetes, to gather some more feedback on the topic. I would like to better understand how it works under the hood, in order to able to actively work on a tentative solution: https://github.com/kubernetes/kubernetes/issues/78405
@justinsb @mikesplain Do you have any thought or experience on this?
We are facing the same issue when we have burst loads of pods spawning too. This happens in combination with Cilium that depends on the kube-proxy pod to be initialized. So we lost the entire node with this error
Failed to admit pod kube-proxy-ip-172-16-123-131.ec2.internal_kube-system(031b2b73a3f2f24dc9e9188a18e86330) - Unexpected error while attempting to recover from admission failure: preemption: pod filebeat-nczgs_filebeat(0a23fbea-87c3-11e9-9f55-0e2b991d93be) failed to evict failed to "KillPodSandbox" for "0a23fbea-87c3-11e9-9f55-0e2b991d93be" with KillPodSandboxError: "rpc error: code = Unknown desc = NetworkPlugin cni failed to teardown pod \"filebeat-nczgs_filebeat\" network: failed to find plugin \"cilium-cni\" in path [/opt/cni/bin/]"
I'm following the PR kubernetes/kubernetes#78493 which has already been approved, but we really need a palliative solution to the problem. Do you see any temporary fix?
@luanguimaraesla I'm not aware of reliable workarounds to avoid it, but if you find any I would be very interested as well. Looking at the kubelet
code, once a static pod fails to get admitted it will be never tried again, so I believe a very hacky workaround could be try to avoid such condition (ie. make the node unschedulable until kube-proxy
is ready, then make it schedulable again, but I haven't tried and not sure it may work).
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
/remove-lifecycle rotten /lifecycle frozen
Some CNI plugins like Cilium and Calico are adding eBPF support. Deploying kube-proxy
as a DaemonSet would make it possible disable it and to enable eBPF support at same time, without the need for a rolling-update or downtime.
Looks like this issue can be closed now. The kube-proxy is deployed as a DS via addons.
1. Describe IN DETAIL the feature/behavior/change you would like to see. Kubernetes best practices state that kube-proxy is best deployed as a DaemonSet. As kops - as of now - places the kube-proxy manifest into /etc/kubernetes/manifests as a file, this is against this best practice.
2. Feel free to provide a design supporting your feature request. We could deploy kube-proxy using the addon mechanisms