Impossible to notice without major delay when a container exits while the POD is in terminating state

LuiGuest commented 1 month ago

What happened?

When a POD which is part of a service starts graceful termination, and the main container exits the related endpoint in endpointslice does not go to serving=false. Only after readiness probe fails will the endpoint enter serving=false state. Which doesn't make sense in my opinion.

I also tried watching for POD changes and determine that the container exited based on container_statuses list. But there I hit this issue: https://github.com/kubernetes/kubernetes/issues/106896

Then I tried to remove the label which selects the POD as an endpoint by the service, but then I experienced ~5s delay till kubernetes notified that the endpoint has disappeared.

What did you expect to happen?

The endpoint should go to serving=false when any container exits in a POD which is under graceful termination. Also the alternatives I tried should have worked I think.

How can we reproduce it (as minimally and precisely as possible)?

Create a POD with 2 containers, 1 which exits immediately after receiving SIGTRERM, the other doesn't so it is killed by SIGKILL after termination grace period. The first container should also have a readiness probe which doesn't fail quick. ex.: period 5s, failureThreshold 3. Make the POD be part of a service. Delete the pod. Check the endpoint in the endpointslice only going to serving=false, after the readiness probe fails.

Anything else we need to know?

No response

Kubernetes version

kubectl version Client Version: v1.28.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.1

Cloud provider

N/A

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

LuiGuest commented 1 month ago

/sig node

pranav-pandey0804 commented 1 month ago

@LuiGuest it might be useful to adjust the behavior of the endpoints controller to immediately mark the endpoint as serving=false when any container in the pod exits during the termination process.

pranav-pandey0804 commented 1 month ago

This change could help in reflecting the pod's status more accurately and promptly.

uablrek commented 1 month ago

/sig network

As I recall, terminating endpoints should handle service requests. But I may be wrong @aojea, @danwinship ?

LuiGuest commented 1 month ago

/sig network

As I recall, terminating endpoints should handle service requests. But I may be wrong @aojea, @danwinship ?

I agree in that, but after a container exits, especially the request handler container, then we can be sure that it is not serving anymore.

danwinship commented 1 month ago

I don't think things outside of kubelet are really expected to notice individual containers within a pod; the Pod is an abstraction barrier that other controllers don't normally peek inside...

Why is one container exiting and another not? Is one of them staying behind to do cleanup or something like that? I can see use cases for this but I'm not sure the APIs are intended to support things like this currently.

pranav-pandey0804 commented 1 month ago

i think if a container that's handling requests exits, it's important that the endpoint stops serving traffic quickly.

pranav-pandey0804 commented 1 month ago

while Kubernetes currently focuses on the pod as a whole, it might be worth exploring a feature that allows for quicker updates to the endpoint status when key containers exit.

pranav-pandey0804 commented 1 month ago

kubernetes treats the pod as a single unit, so it doesn't normally check the status of individual containers. But in cases where containers have different roles, it might be useful to explore adding more control over when an endpoint should stop serving traffic based on specific container behavior.

LuiGuest commented 1 month ago

"I don't think things outside of kubelet are really expected to notice individual containers within a pod; the Pod is an abstraction barrier that other controllers don't normally peek inside..."

I think a terminating POD is different. For example kubernetes already knows not to restart a container that exited in a terminating POD. I think it could similarly remove the POD from serving endpoints list, because it is probable that the POD is not capable of serving traffic normally like that. At least there should be a way of telling kubernetes that these are my containers which are absolutely needed to handle requests.

"Why is one container exiting and another not? Is one of them staying behind to do cleanup or something like that? I can see use cases for this but I'm not sure the APIs are intended to support things like this currently."

Because all other containers are side car containers to the payload handler container, and we wanted to make sure that no containers exit before the main one.

danwinship commented 1 month ago

For example kubernetes already knows not to restart a container that exited in a terminating POD.

Right, but that's kubelet that's doing the (not) restarting. I was saying, kubelet is aware of individual containers within a pod, but right now, other components generally aren't.

I'm leaving this as needs-triage so it will get discussed at the SIG Network meeting on Thursday...

danwinship commented 1 month ago

@thockin any thoughts on what we could or should do here? (I had wanted to discuss this in bug triage but you weren't here this week...)

danwinship commented 1 month ago

/assign @thockin

MikeZappa87 commented 1 month ago

/cc @aojea

thockin commented 1 month ago

My initial reaction was the same as @danwinship so I was heartened to see him agree. There are a lot of problems with this propsal.

First, a pod can have multiple ports serving, so the exit of one container could now cause the other to stop serving.

Second, Kubelet (and everyone else) may not even be aware of which container is serving a given port. The use of Pod.spec.containers[].ports is generally informational and (in my experience seldom used). Even if it were widely used...

Third, the EndpointSlice API is a "matrix" of endpoints on one axis and ports on the other. There's no way to indicate that port X of endpoint A is serving while port Y is not, short of breaking it into 2 slices.

One thing we could do, maybe, is use the container-exit + pod-terminating as a trigger to run the readiness probe "right now". If the pod has a threshhold greater than 1, though, we still have to wait. Unless we change the probing logic.

I'm not against a proposal in this direction, but I will say that it seems somewhat lower priority, given that the pod is going down imminently anyway.

kannon92 commented 1 month ago

/triage accepted

kubernetes / kubernetes