Open LuiGuest opened 1 month ago
/sig node
@LuiGuest
it might be useful to adjust the behavior of the endpoints controller to immediately mark the endpoint as serving=false
when any container in the pod exits during the termination process.
This change could help in reflecting the pod's status more accurately and promptly.
/sig network
As I recall, terminating endpoints should handle service requests. But I may be wrong @aojea, @danwinship ?
/sig network
As I recall, terminating endpoints should handle service requests. But I may be wrong @aojea, @danwinship ?
I agree in that, but after a container exits, especially the request handler container, then we can be sure that it is not serving anymore.
I don't think things outside of kubelet are really expected to notice individual containers within a pod; the Pod is an abstraction barrier that other controllers don't normally peek inside...
Why is one container exiting and another not? Is one of them staying behind to do cleanup or something like that? I can see use cases for this but I'm not sure the APIs are intended to support things like this currently.
i think if a container that's handling requests exits, it's important that the endpoint stops serving traffic quickly.
while Kubernetes currently focuses on the pod as a whole, it might be worth exploring a feature that allows for quicker updates to the endpoint status when key containers exit.
kubernetes treats the pod as a single unit, so it doesn't normally check the status of individual containers. But in cases where containers have different roles, it might be useful to explore adding more control over when an endpoint should stop serving traffic based on specific container behavior.
"I don't think things outside of kubelet are really expected to notice individual containers within a pod; the Pod is an abstraction barrier that other controllers don't normally peek inside..."
I think a terminating POD is different. For example kubernetes already knows not to restart a container that exited in a terminating POD. I think it could similarly remove the POD from serving endpoints list, because it is probable that the POD is not capable of serving traffic normally like that. At least there should be a way of telling kubernetes that these are my containers which are absolutely needed to handle requests.
"Why is one container exiting and another not? Is one of them staying behind to do cleanup or something like that? I can see use cases for this but I'm not sure the APIs are intended to support things like this currently."
Because all other containers are side car containers to the payload handler container, and we wanted to make sure that no containers exit before the main one.
For example kubernetes already knows not to restart a container that exited in a terminating POD.
Right, but that's kubelet that's doing the (not) restarting. I was saying, kubelet is aware of individual containers within a pod, but right now, other components generally aren't.
I'm leaving this as needs-triage
so it will get discussed at the SIG Network meeting on Thursday...
@thockin any thoughts on what we could or should do here? (I had wanted to discuss this in bug triage but you weren't here this week...)
/assign @thockin
/cc @aojea
My initial reaction was the same as @danwinship so I was heartened to see him agree. There are a lot of problems with this propsal.
First, a pod can have multiple ports serving, so the exit of one container could now cause the other to stop serving.
Second, Kubelet (and everyone else) may not even be aware of which container is serving a given port. The use of Pod.spec.containers[].ports
is generally informational and (in my experience seldom used). Even if it were widely used...
Third, the EndpointSlice API is a "matrix" of endpoints
on one axis and ports
on the other. There's no way to indicate that port X of endpoint A is serving while port Y is not, short of breaking it into 2 slices.
One thing we could do, maybe, is use the container-exit + pod-terminating as a trigger to run the readiness probe "right now". If the pod has a threshhold greater than 1, though, we still have to wait. Unless we change the probing logic.
I'm not against a proposal in this direction, but I will say that it seems somewhat lower priority, given that the pod is going down imminently anyway.
/triage accepted
What happened?
When a POD which is part of a service starts graceful termination, and the main container exits the related endpoint in endpointslice does not go to serving=false. Only after readiness probe fails will the endpoint enter serving=false state. Which doesn't make sense in my opinion.
I also tried watching for POD changes and determine that the container exited based on container_statuses list. But there I hit this issue: https://github.com/kubernetes/kubernetes/issues/106896
Then I tried to remove the label which selects the POD as an endpoint by the service, but then I experienced ~5s delay till kubernetes notified that the endpoint has disappeared.
What did you expect to happen?
The endpoint should go to serving=false when any container exits in a POD which is under graceful termination. Also the alternatives I tried should have worked I think.
How can we reproduce it (as minimally and precisely as possible)?
Create a POD with 2 containers, 1 which exits immediately after receiving SIGTRERM, the other doesn't so it is killed by SIGKILL after termination grace period. The first container should also have a readiness probe which doesn't fail quick. ex.: period 5s, failureThreshold 3. Make the POD be part of a service. Delete the pod. Check the endpoint in the endpointslice only going to serving=false, after the readiness probe fails.
Anything else we need to know?
No response
Kubernetes version
kubectl version Client Version: v1.28.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.1
Cloud provider
N/A
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)