kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.14k stars 39.68k forks source link

A mechanism to fail a Pod that is stuck due to an invalid Image #122300

Open alculquicondor opened 11 months ago

alculquicondor commented 11 months ago

What would you like to be added?

A mechanism to set a Pod into phase=Failed when the image pull has failed for a number of times (perhaps configurable). Currently, the Pod just stays in phase=Pending

Why is this needed?

This is especially problematic for Jobs submitted through a queueing system.

In a queued environment, the time when the job starts running (pods created) might be hours or even days after the Job is created. Then, the user that submitted the job might not realize their mistake until it's too late. Since these Pods block resources in the cluster, it might cause other pending Jobs not to start.

If the Pods stay in phase=Pending, the job controller cannot do anything about them, as it only does "failure handling" once the Pods actually terminate with a phase=Failed.

alculquicondor commented 11 months ago

/wg batch /sig node

k8s-ci-robot commented 11 months ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
alculquicondor commented 11 months ago

cc @kannon92

alculquicondor commented 11 months ago

Prior work was in https://github.com/kubernetes/enhancements/pull/3816

kannon92 commented 11 months ago

More general issue was here

BenTheElder commented 11 months ago

xref: https://github.com/kubernetes/test-infra/blob/954bd25d267d71e210848cd2783410c4297e7ce2/prow/apis/prowjobs/v1/types.go#L532-L540

Prow sees incorrect images fairly often and catch-all podPendingTimeout has been helpful to mitigate this. We have a controller-wide default with per-ProwJob override.

kannon92 commented 11 months ago

The phrase "invalid image" seems misleading here.

You essentially want a limit on the number of tries for image pull and then you want to declare it as failed.

"Invalid image" bothers me because you can have validation on the images (InvalidImageName) or other types of validations where your image pull fails. I don't know if all these cases will cause it to go to image pull backoff.

You also have cases where the secrets are invalid so the pod is in ErrImagePull state..

kannon92 commented 11 months ago

xref: https://github.com/kubernetes/test-infra/blob/954bd25d267d71e210848cd2783410c4297e7ce2/prow/apis/prowjobs/v1/types.go#L532-L540

Prow sees incorrect images fairly often and catch-all podPendingTimeout has been helpful to mitigate this. We have a controller-wide default with per-ProwJob override.

That is an interesting idea but I worry that ML workloads may have experience much slower time to pull images. Do you have a lot of tuning for these pull times in Prow?

BenTheElder commented 11 months ago

We haven't needed to change that timeout much, It's usually set to something fairly generous and isn't exclusively covering image pulls. I could imagine certain environments with extremely slow image pulls of large images over a slow network but I bet those hit some other timeouts that need tuning already.

And yes, good point on the phrasing.

These are potentially valid image references, and it would be a layering violation to check if the image exists (plus auth concerns etc) and even if we could check the image could exist later so we have to assume they're totally valid if the format is right. So "invalid" isn't the right word.

I think the point remains that there's an issue where a job refers to an image that doesn't exist and will never become pullable can become stuck forever and waste resources. Maybe call it "non-existent images"

AxeZhan commented 11 months ago

/cc

sftim commented 11 months ago

BTW, there may already be community solutions for this; for example, I think Kyverno can spot this and fail the Pod.

kannon92 commented 11 months ago

Reposting my thoughts here:

cc @kannon92 you thought about this problem before. Maybe it's time to pursue it.

Yes, this was the crux of https://github.com/kubernetes/enhancements/pull/3816. If anyone wants to pick that up, please feel free. I don't think I will have the bandwidth to lead that anytime soon.

I decided to focus on PodReadyToStartContainers as this was already an existing feature. This feature was for issues in pod sandbox creation. This was promoted to beta in 1.29 so this will at least display a condition for sandbox issues.

My 2c on this is that there are two lifecycle issues that are not well covered by pod conditions/errors: image issues and pod lifecycle (ie pod hooks or config setup).

If you have any issues in image pull (validation, secrets, etc), then we hit an issue and it could go into ImagePullBackOff. That PR I linked was mostly about adding conditions for these cases.

So what I have thought is that maybe there is a need for two conditions (image lifecycle and pod lifecycle). I don't know if I should consider separating that KEP into two different ones per conditions.

charles-chenzz commented 11 months ago

@kannon92 I might have bandwidth to take on KEP-3815, can I take it on to see if we can push it forward?

AxeZhan commented 11 months ago

I looked into https://github.com/kubernetes/enhancements/pull/3816 . And it seems this KEP aims to add a condition for corresponding pods, but this issue aims to change the pod's phase also?

alculquicondor commented 11 months ago

Both would be useful. It can be the same KEP, though.

kannon92 commented 11 months ago

So since @alculquicondor has a use case in mind for this, I would prioritize this feature over mine. I brought it up because the main issue you will run into if you take this issue is that there is no stable condition or field for these cases. Now most of them are probably not as important for ops than image pulls so..

Maybe as part of this KEP, we could add a condition for image errors and also add this functionality.

kannon92 commented 11 months ago

Also remember to compare different container runtimes as I'm not sure if my findings in that KEP were transferable to crio.

alculquicondor commented 11 months ago

@kannon92 I don't know if you already had agreement from sig-node that this feature is feasible. Otherwise, I would recommend attending a SIG Node meeting before wasting too much time on a KEP.

kannon92 commented 11 months ago

@kannon92 I don't know if you already had agreement from sig-node that this feature is feasible. Otherwise, I would recommend attending a SIG Node meeting before wasting too much time on a KEP.

Which feature? I never got much attention from sig-node on my original feature. For who takes this, I agree that they should take this feature to sig-node and see if it is viable.

tenzen-y commented 9 months ago

The potential concrete use-case is here: https://github.com/kubeflow/training-operator/issues/2001

kannon92 commented 8 months ago

I am going to take this to sig-node on Mar 5th and see if its a viable option for someone to pursue.

edit: pushed to Mar 12th.

dchen1107 commented 8 months ago

+1 on adding the upper limit.

kannon92 commented 8 months ago

In sig-node, @mrunalp brought up that maybe we should consider both ImagePullBackOff and CrashLoopBackOff for pod lifecycle. One is related to images and the other is related to failures in the pod lifecycle.

It may be worth considering both in this feature.

k8s-triage-robot commented 5 months ago

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

BenTheElder commented 5 months ago

cc @lauralorenz xref https://github.com/kubernetes/enhancements/issues/4603

SergeyKanzhelev commented 5 months ago

I think it falls nicely under unscheduling of pods category: https://docs.google.com/document/d/1fOx4PTN61sDv6u9C-t-WpOH8hi2IpenJWPc-Za8UAQE/edit#heading=h.1tvyczqnfmzb as well. The biggest problem as mentioned above is to distinguish recoverable and non-recoverable image pull errors or what value to set the limit on retries to.

lauralorenz commented 2 months ago

Hello! This is interesting. I do think this can be worked on independently of CrashLoopBackoff. Addressing the ImagePull backoff is explicitly out of scope for what I'm working on related to that in https://github.com/kubernetes/enhancements/issues/4603 (technically they share some configuration /today/ but as part of my work I will be separating that so that I don't affect image pull backoff with my changes).

I think it /should/ be worked on separately because from convos I've had around this space, compared to application crashes controlled by crashloopbackoff, image pull failures are expected to be easier for kubernetes to introspect and take action on (like if it's a 404 or a 413, probably not going to get fixed asap, so you could fail early or at least loop differently). So we should research into those use cases separately and they should have their own backoff logic.

So while I think they both should be worked on, if I'm working on crashloopbackoff, and you are working on imagepullbackoff, I think that works great! And we can work together for ideas at least since we're hanging out the same code area

kannon92 commented 2 months ago

Sounds good. I'm not sure if I will get this into 1.32 at the moment but wanted to see if this is a usecase we could address. If not, then we can target ImagePullBackOff separately.

tallclair commented 2 months ago

This is somewhat tangential to this issue, but we need a better feedback mechanism to controllers (particularly replicaset) for pods failing with non-recoverable configuration errors. Otherwise we can end up in a hot-looping situation where the controller repeatedly recreates a pod that's immediately marked as failed.

I don't think that would block this issue since the initial image pull backoff would prevent a hot-loop situation.

kannon92 commented 2 months ago

Agree. I sorta started that discussion when I joined the k8s community. https://github.com/kubernetes/enhancements/pull/3816

It was my first KEP I tried to tackle so I got a bit overwhelmed.