Open alculquicondor opened 11 months ago
/wg batch /sig node
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
cc @kannon92
Prior work was in https://github.com/kubernetes/enhancements/pull/3816
Prow sees incorrect images fairly often and catch-all podPendingTimeout has been helpful to mitigate this. We have a controller-wide default with per-ProwJob override.
The phrase "invalid image" seems misleading here.
You essentially want a limit on the number of tries for image pull and then you want to declare it as failed.
"Invalid image" bothers me because you can have validation on the images (InvalidImageName
) or other types of validations where your image pull fails. I don't know if all these cases will cause it to go to image pull backoff.
You also have cases where the secrets are invalid so the pod is in ErrImagePull
state..
Prow sees incorrect images fairly often and catch-all podPendingTimeout has been helpful to mitigate this. We have a controller-wide default with per-ProwJob override.
That is an interesting idea but I worry that ML workloads may have experience much slower time to pull images. Do you have a lot of tuning for these pull times in Prow?
We haven't needed to change that timeout much, It's usually set to something fairly generous and isn't exclusively covering image pulls. I could imagine certain environments with extremely slow image pulls of large images over a slow network but I bet those hit some other timeouts that need tuning already.
And yes, good point on the phrasing.
These are potentially valid image references, and it would be a layering violation to check if the image exists (plus auth concerns etc) and even if we could check the image could exist later so we have to assume they're totally valid if the format is right. So "invalid" isn't the right word.
I think the point remains that there's an issue where a job refers to an image that doesn't exist and will never become pullable can become stuck forever and waste resources. Maybe call it "non-existent images"
/cc
BTW, there may already be community solutions for this; for example, I think Kyverno can spot this and fail the Pod.
Reposting my thoughts here:
cc @kannon92 you thought about this problem before. Maybe it's time to pursue it.
Yes, this was the crux of https://github.com/kubernetes/enhancements/pull/3816. If anyone wants to pick that up, please feel free. I don't think I will have the bandwidth to lead that anytime soon.
I decided to focus on PodReadyToStartContainers
as this was already an existing feature. This feature was for issues in pod sandbox creation. This was promoted to beta in 1.29 so this will at least display a condition for sandbox issues.
My 2c on this is that there are two lifecycle issues that are not well covered by pod conditions/errors: image issues and pod lifecycle (ie pod hooks or config setup).
If you have any issues in image pull (validation, secrets, etc), then we hit an issue and it could go into ImagePullBackOff
. That PR I linked was mostly about adding conditions for these cases.
So what I have thought is that maybe there is a need for two conditions (image lifecycle and pod lifecycle). I don't know if I should consider separating that KEP into two different ones per conditions.
@kannon92 I might have bandwidth to take on KEP-3815, can I take it on to see if we can push it forward?
I looked into https://github.com/kubernetes/enhancements/pull/3816 . And it seems this KEP aims to add a condition for corresponding pods, but this issue aims to change the pod's phase also?
Both would be useful. It can be the same KEP, though.
So since @alculquicondor has a use case in mind for this, I would prioritize this feature over mine. I brought it up because the main issue you will run into if you take this issue is that there is no stable condition or field for these cases. Now most of them are probably not as important for ops than image pulls so..
Maybe as part of this KEP, we could add a condition for image errors and also add this functionality.
Also remember to compare different container runtimes as I'm not sure if my findings in that KEP were transferable to crio.
@kannon92 I don't know if you already had agreement from sig-node that this feature is feasible. Otherwise, I would recommend attending a SIG Node meeting before wasting too much time on a KEP.
@kannon92 I don't know if you already had agreement from sig-node that this feature is feasible. Otherwise, I would recommend attending a SIG Node meeting before wasting too much time on a KEP.
Which feature? I never got much attention from sig-node on my original feature. For who takes this, I agree that they should take this feature to sig-node and see if it is viable.
The potential concrete use-case is here: https://github.com/kubeflow/training-operator/issues/2001
I am going to take this to sig-node on Mar 5th and see if its a viable option for someone to pursue.
edit: pushed to Mar 12th.
+1 on adding the upper limit.
In sig-node, @mrunalp brought up that maybe we should consider both ImagePullBackOff and CrashLoopBackOff for pod lifecycle. One is related to images and the other is related to failures in the pod lifecycle.
It may be worth considering both in this feature.
This issue is labeled with priority/important-soon
but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
/triage accepted
(org members only)/priority important-longterm
or /priority backlog
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
cc @lauralorenz xref https://github.com/kubernetes/enhancements/issues/4603
I think it falls nicely under unscheduling of pods category: https://docs.google.com/document/d/1fOx4PTN61sDv6u9C-t-WpOH8hi2IpenJWPc-Za8UAQE/edit#heading=h.1tvyczqnfmzb as well. The biggest problem as mentioned above is to distinguish recoverable and non-recoverable image pull errors or what value to set the limit on retries to.
Hello! This is interesting. I do think this can be worked on independently of CrashLoopBackoff. Addressing the ImagePull backoff is explicitly out of scope for what I'm working on related to that in https://github.com/kubernetes/enhancements/issues/4603 (technically they share some configuration /today/ but as part of my work I will be separating that so that I don't affect image pull backoff with my changes).
I think it /should/ be worked on separately because from convos I've had around this space, compared to application crashes controlled by crashloopbackoff, image pull failures are expected to be easier for kubernetes to introspect and take action on (like if it's a 404 or a 413, probably not going to get fixed asap, so you could fail early or at least loop differently). So we should research into those use cases separately and they should have their own backoff logic.
So while I think they both should be worked on, if I'm working on crashloopbackoff, and you are working on imagepullbackoff, I think that works great! And we can work together for ideas at least since we're hanging out the same code area
Sounds good. I'm not sure if I will get this into 1.32 at the moment but wanted to see if this is a usecase we could address. If not, then we can target ImagePullBackOff separately.
This is somewhat tangential to this issue, but we need a better feedback mechanism to controllers (particularly replicaset) for pods failing with non-recoverable configuration errors. Otherwise we can end up in a hot-looping situation where the controller repeatedly recreates a pod that's immediately marked as failed.
I don't think that would block this issue since the initial image pull backoff would prevent a hot-loop situation.
Agree. I sorta started that discussion when I joined the k8s community. https://github.com/kubernetes/enhancements/pull/3816
It was my first KEP I tried to tackle so I got a bit overwhelmed.
What would you like to be added?
A mechanism to set a Pod into phase=Failed when the image pull has failed for a number of times (perhaps configurable). Currently, the Pod just stays in phase=Pending
Why is this needed?
This is especially problematic for Jobs submitted through a queueing system.
In a queued environment, the time when the job starts running (pods created) might be hours or even days after the Job is created. Then, the user that submitted the job might not realize their mistake until it's too late. Since these Pods block resources in the cluster, it might cause other pending Jobs not to start.
If the Pods stay in phase=Pending, the job controller cannot do anything about them, as it only does "failure handling" once the Pods actually terminate with a phase=Failed.