Open sendaie opened 2 years ago
Thanks for this report @sendaie.
Ironically enough, we used to use Envoy's admin UI to do the shutdown over the network, which avoids this problem, but it led us to our first CVE (https://github.com/projectcontour/contour/security/advisories/GHSA-mjp8-x484-pm3r), which the EmptyDir solution was added to solve.
I think we'll need to spend some time investigating how best to handle this tricky set of requirements.
What we have done to work around this is create container using Envoy as the base and included the contour
binary in it which is run both on startup (for the initContainer
) case and on shutdown.
The run happens via a custom entry point and a preStop
hook that terminates the pod after a few seconds of sleep to allow for zero downtime upgrades.
For clusters, where the Envoy pods autoscale, we have set the externalTrafficPolicy
to Cluster
to avoid packet loss when nodes running Envoy are removed. This enhancement when fully merged would remove that requirement.
The daemonset is what is blocking you on the volume bits. Have a look to see if the deployment model would work. There's an example in the examples dir. It will allow for a clean termination when draining a node.
@stevesloka We moved from a Daemonset to a Deployment to address the issues around clean termination but ran into draining issues due to the emptyDir.
Moving into one container makes it almost like other ingress controller setups with only one container and no mounts over emptyDirs.
I'm surprised you get the emptydir notice with a deployment, I'm not aware of that happening.
What version of k8s are you on?
We are using 1.21.
The default drain options on a node do not include options to drain pods with mounts via emptyDir.
There is some discussion in https://github.com/kubernetes/kubernetes/issues/80228
The Contour project currently lacks enough contributors to adequately respond to all Issues.
This bot triages Issues according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack
The Contour project currently lacks enough contributors to adequately respond to all Issues.
This bot triages Issues according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack
Hello Contour community!
We run Contour in GKE, behind a L4 Internal TCP Load Balancer (which is brought up by GKE via our Envoy service, Type: LoadBalancer, annotation "cloud.google.com/load-balancer-type: Internal"). We have an issue, where we see our Envoy pods in our Contour configuration are not getting evacuated/deleted during a drain, either because it's a daemonset, or when we switch to a deployment because the pod is using local storage:
The localstorage in this case is the emptyDir definition for "/config" directory, which is shared between Envoy, the initContainer and the shutdown-manager.
What we experience is that whenever an automatic node removal is taking place (cluster upgrade, node pool rollover, autoscaling scale down), our L4 Internal Load Balancer still forwards traffic to this pod/node until the LB's health check fails (which is up to 3 x 8 = 24 seconds, and this is out of our control).
We're assuming that if the Pod deletion would succeed, potentially that could signal the fact to the control plane that this Pod is gone, which would make sure that the GKE L4 LB would stop forwarding traffic to it.
We can reproduce the problem, and wondering whether anyone experienced this as well, and whether if there is a setup where Contour doesn't use local storage.
Thank you, sendai