Retriable and non-retriable Pod failures for Jobs

alculquicondor commented 2 years ago

Enhancement Description

One-line enhancement description (can be used as a release note): An API to influence retries based on exit codes and/or pod deletion reasons.
Kubernetes Enhancement Proposal: https://git.k8s.io/enhancements/keps/sig-apps/3329-retriable-and-non-retriable-failures
Discussion Link: https://github.com/kubernetes/kubernetes/issues/17244
Primary contact (assignee): @alculquicondor
Responsible SIGs: apps, api-machinery, scheduling
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.25
- Beta release target (x.y): 1.26
- Stable release target (x.y): 1.31
[x] Alpha
- [x] KEP (k/enhancements) update PR(s):
- 3374
- 3438
- 3447
- 3452
- [x] Code (k/k) update PR(s):
- kubernetes/kubernetes#111070
- kubernetes/kubernetes#111084
- kubernetes/kubernetes#111091
- kubernetes/kubernetes#110959
- kubernetes/kubernetes#111113
- kubernetes/kubernetes#111475
- [x] Docs (k/website) update PR(s): kubernetes/website#35219
[x] Beta
- [x] KEP (k/enhancements) update PR(s):
- 3463
- 3646
- 3769
- 3757
- v1.28
  - https://github.com/kubernetes/enhancements/pull/3965
  - https://github.com/kubernetes/enhancements/pull/3940
- v1.30
  - https://github.com/kubernetes/enhancements/pull/4442
- [x] Code (k/k) update PR(s):
- https://github.com/kubernetes/kubernetes/pull/112360
- https://github.com/kubernetes/kubernetes/pull/113324
- https://github.com/kubernetes/kubernetes/pull/113304
- https://github.com/kubernetes/kubernetes/pull/113360
- https://github.com/kubernetes/kubernetes/pull/113812
- https://github.com/kubernetes/kubernetes/pull/113580
- https://github.com/kubernetes/kubernetes/pull/113856
- https://github.com/kubernetes/kubernetes/pull/113860
- https://github.com/kubernetes/kubernetes/pull/113927
- kubernetes/kubernetes#114770
- kubernetes/kubernetes#114914
- kubernetes/kubernetes#115056
- https://github.com/kubernetes/kubernetes/pull/115331
- https://github.com/kubernetes/kubernetes/pull/116554
- https://github.com/kubernetes/kubernetes/pull/117586
- https://github.com/kubernetes/kubernetes/pull/117015
- https://github.com/kubernetes/kubernetes/pull/121103
- [x] Docs (k/website) update(s):
- https://github.com/kubernetes/website/pull/37242
- https://github.com/kubernetes/website/pull/38040
- https://github.com/kubernetes/website/pull/38042
- https://github.com/kubernetes/website/pull/39809
- https://github.com/kubernetes/website/pull/41745
[x] Stable
- [x] KEP (k/enhancements) update PR(s): #4661
- [x] Code (k/k) update PR(s):
- https://github.com/kubernetes/kubernetes/pull/125533
- https://github.com/kubernetes/kubernetes/pull/125442
- https://github.com/kubernetes/kubernetes/pull/125461
- https://github.com/kubernetes/kubernetes/pull/125482
- https://github.com/kubernetes/kubernetes/pull/126046
- https://github.com/kubernetes/kubernetes/pull/126644
- https://github.com/kubernetes/kubernetes/pull/125994
- https://github.com/kubernetes/kubernetes/pull/126102
- [x] Docs (k/website) update(s):
- https://github.com/kubernetes/website/pull/46807
- https://github.com/kubernetes/website/pull/47378
- https://github.com/kubernetes/website/pull/46896

alculquicondor commented 2 years ago

/sig apps /wg batch

alculquicondor commented 2 years ago

/assign

mimowo commented 2 years ago

/assign

alculquicondor commented 2 years ago

/sig scheduling /sig api-machinery

Priyankasaggu11929 commented 2 years ago

Hello @alculquicondor 👋, 1.25 Enhancements team here.

Just checking in as we approach enhancements freeze on 18:00 PT on Thursday June 23, 2022, which is just over 2 days from now.

For note, This enhancement is targeting for stage alpha for 1.25 (correct me, if otherwise)

Here's where this enhancement currently stands:

[ ] KEP file using the latest template has been merged into the k/enhancements repo.
[ ] KEP status is marked as implementable
[ ] KEP has a updated detailed test plan section filled out
[ ] KEP has up to date graduation criteria
[ ] KEP has a production readiness review that has been completed and merged into k/enhancements.

The open PR https://github.com/kubernetes/enhancements/pull/3374 is addressing all the listed criteria above. We would just require getting it merged by the Enhancements Freeze.

For note, the status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

Priyankasaggu11929 commented 2 years ago

With KEP PR #3374 merged, the enhancement is ready for the 1.25 Enhancements Freeze.

For note, the status is now marked as tracked. Thank you so much!

kcmartin commented 2 years ago

Hello @alculquicondor 👋, 1.25 Release Docs Lead here. This enhancement is marked as ‘Needs Docs’ for 1.25 release.

Please follow the steps detailed in the documentation to open a PR against dev-1.25 branch in the k/website repo. This PR can be just a placeholder at this time, and must be created by August 4.  Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.  Thank you!

Atharva-Shinde commented 2 years ago

Hi @alculquicondor @mimowo, Enhancements team here again 👋

Checking in as we approach Code Freeze at 01:00 UTC on Wednesday, 3rd August 2022.

Please ensure that the following items are completed before the code-freeze:

[X] All PRs to the Kubernetes repo that are related to your enhancement are linked in the above issue description (for tracking purposes). https://github.com/kubernetes/kubernetes/pull/111070 https://github.com/kubernetes/kubernetes/pull/111084 https://github.com/kubernetes/kubernetes/pull/111091 https://github.com/kubernetes/kubernetes/pull/110959 https://github.com/kubernetes/kubernetes/pull/111113 https://github.com/kubernetes/kubernetes/pull/111475
[ ] All PRs are fully merged by the code freeze deadline. https://github.com/kubernetes/kubernetes/pull/110959 https://github.com/kubernetes/kubernetes/pull/111113 https://github.com/kubernetes/kubernetes/pull/111475

Let me know if there are any additional k/k PRs besides the ones listed above

Currently, the status of the enhancement is marked as at-risk

Thanks :)

mimowo commented 2 years ago

@Atharva-Shinde @alculquicondor there is one more PR that should be included before the code freeze: https://github.com/kubernetes/kubernetes/pull/111475

Atharva-Shinde commented 2 years ago

thank you @mimowo, I have updated my comment with the PR and have also tagged you for future reference :)

Atharva-Shinde commented 2 years ago

Hey @alculquicondor @mimowo, reaching out again as we approach Code Freeze at 01:00 UTC on this Wednesday i.e 3rd August 2022. Try to get all the open Code(k/k) PRs mentioned in the Issue Description merged before the code-freeze :) https://github.com/kubernetes/kubernetes/pull/110959 https://github.com/kubernetes/kubernetes/pull/111113 https://github.com/kubernetes/kubernetes/pull/111475

The status of the enhancement is still marked as at-risk

Priyankasaggu11929 commented 2 years ago

Hello :wave:, 1.25 Enhancements Lead here.

Following discussion in Slack the release team is APPROVING this exception request. Your updated deadline to make any changes to your PR is 9:30 AM PST Monday 8th August 2022

https://groups.google.com/u/0/g/kubernetes-sig-release/c/EBdL_-Jhv_s

Thank you!

alculquicondor commented 2 years ago

All the implementation PRs are merged :smiley:

IIUC, we are still making it to the beta release?

Priyankasaggu11929 commented 2 years ago

Yes, the enhancement is marked as tracked for 1.25 release cycle. Thank you so much! 🙂

alculquicondor commented 1 year ago

/sig node

soltysh commented 1 year ago

/milestone v1.26 /stage beta /label lead-opted-in

rhockenbury commented 1 year ago

/label tracked/yes /remove-label tracked/no

Atharva-Shinde commented 1 year ago

Hey @alculquicondor @mimowo 👋, 1.26 Enhancements team here!

Just checking in as we approach Enhancements Freeze on 18:00 PDT on Thursday 6th October 2022.

This enhancement is targeting for stage beta for 1.26

Here's where this enhancement currently stands:

[X] KEP file using the latest template has been merged into the k/enhancements repo.
[X] KEP status is marked as implementable
[X] KEP has an updated detailed test plan section filled out
[X] KEP has up to date graduation criteria
[ ] KEP has a production readiness review that has been completed and merged into k/enhancements.

For this KEP, we would need to:

Update the kep.yaml to reflect the current milestone information
Update the production readiness review with latest stage information
Include the new updated PR of this KEP in the Issue Description and get it merged before Enhancements Freeze to make this enhancement eligible for 1.26 release.

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you :)

mimowo commented 1 year ago

@Atharva-Shinde the enhancement is targeting Beta for 1.26. This is the KEP update which is currently under review: https://github.com/kubernetes/enhancements/pull/3463.

Atharva-Shinde commented 1 year ago

Thanks @mimowo I've updated my comment :)

derekwaynecarr commented 1 year ago

/milestone v1.26 /label lead-opted-in

(For sig-node, we see this is not attempting to derive any intelligence from kubelet/runtime initiated conditions)

Atharva-Shinde commented 1 year ago

Hello @alculquicondor @mimowo 👋, just a quick check-in again, as we approach the 1.26 Enhancements freeze.

Please plan to get the action items mentioned in my comment above done before Enhancements freeze on 18:00 PDT on Thursday 6th October 2022 i.e tomorrow

For note, the current status of the enhancement is marked at-risk :)

soltysh commented 1 year ago

@Atharva-Shinde the PRR has been approved at the correct beta level in https://github.com/kubernetes/enhancements/pull/3463/files so not quite sure what else do you expect?

Atharva-Shinde commented 1 year ago

Thanks @soltysh for bringing this to my notice (not sure how I missed this sorry for the error), everything is up-to-date! I've updated the KEP status to tracked for 1.26 release cycle :)

parul5sahoo commented 1 year ago

Hi @alculquicondor and @mimowo 👋,

Checking in once more as we approach 1.26 code freeze at 17:00 PDT on Tuesday 8th November 2022.

Please ensure the following items are completed:

[ ] All PRs to the Kubernetes repo that are related to your enhancement are linked in the above issue description (for tracking purposes).
[ ] All PRs are fully merged by the code freeze deadline.

For this enhancement, please plan to get PRs out for all k/k code so it can be merged up by code freeze. If you do have k/k PRs open, please link them to this issue. Let me know if there aren't any further PRs that need to be created or merged for this enhancements, so that I can mark it as tracked for code freeze.

As always, we are here to help should questions come up. Thanks!

cathchu commented 1 year ago

Hello @alculquicondor and @mimowo 👋 1.26 Release Docs shadow here!

This enhancement is marked as ‘Needs Docs’ for 1.26 release. Please follow the steps detailed in the documentation to open a PR against dev-1.26 branch in the k/website repo. This PR can be just a placeholder at this time, and must be created by November 9. Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.

Thank you!

mimowo commented 1 year ago

The placeholder PR is prepared: https://github.com/kubernetes/website/pull/37242. @alculquicondor please reference it in the Issue description.

parul5sahoo commented 1 year ago

Hey @mimowo and @alculquicondor ,

As the Code freeze is just a day away, just wanted to confirm that there are no open PRs in the K/K repo or any repo in general for this enhancement other than the ones outlined in the issue description? Please get the open PRs merged before the code freeze, so that the enhancement can be marked tracked.

mimowo commented 1 year ago

There is one more k/e PR with a purpose to align the KEP with the decisions taken during the implementation phase. Not sure if it should be blocking for the Code Freeze. Anyway, could you @alculquicondor please add the KEP update to the list of PRs and review / approve.

Hey @mimowo and @alculquicondor ,

As the Code freeze is just a day away, just wanted to confirm that there are no open PRs in the K/K repo or any repo in general for this enhancement other than the ones outlined in the issue description? Please get the open PRs merged before the code freeze, so that the enhancement can be marked tracked.

rhockenbury commented 1 year ago

We have this marked as tracked for code freeze.

marosset commented 1 year ago

/remove-label lead-opted-in /remove-label tracked/yes /label tracked/no /milestone clear

thockin commented 1 year ago

Hi, I was reading over KEPs and found this one. I don't fundamentally object, but I didn't really understand why NOT put at least the restart logic in Pod? It seems like you could define restartable exit codes and only restartOnFailure for those codes.

Also, do you lump signals into exit codes? Should that not be called out separately? E.g. OOM -> SIGKILL. That happens to be exit 136, but that isn't the most natural expression.

mimowo commented 1 year ago

Hi Tim, thank you for looking at the KEP and the questions.

Hi, I was reading over KEPs and found this one. I don't fundamentally object, but I didn't really understand why NOT put at least the restart logic in Pod? It seems like you could define restartable exit codes and only restartOnFailure for those codes.

We have considered placing the API at the pod level, and discuss it here: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures#job-level-vs-pod-level-spec. In summary, these are the arguments for making it Job-level:

We want to be able to terminate the entire Job in case of software bugs (which can be matched by exit codes). In the KEP this can be expressed by FailJob action. We wanted to avoid extending the Pod API to know about Jobs. In the future we consider also extending the list of actions by FailIndex action.
This is a natural extension of the currently present API in Job of controlling Pod restarts, by .spec.backoffLimit.
We support pod disruptions (such as preemption, API triggered eviction, node graceful shutdown, etc., scenarios are documented here: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-conditions). In these cases we do not want to restart the pod (and its containers). In that case we want to give users the ability to configure handling of pod failures to: 1. do not count the pod failure towards the backoffLimit in Job; 2. delete the pod and a replacement pod to be created (for example in case of node graceful shutdown on a new node).

Finally, we consider restarting a pod as a potential future work and we believe that the mechanisms could coexist.

Also, do you lump signals into exit codes? Should that not be called out separately? E.g. OOM -> SIGKILL. That happens to be exit 136, but that isn't the most natural expression.

We support expressing pod failure policies both based on exit codes and the reasons for failure (signals). For this purpose we annotate disrupted pods with a dedicated DisruptionTarget condition. Currently, we don’t support adding a dedicated pod failure condition for OOM kill due to issues:

It is not standardized, we’ve started some effort on the standardization though https://github.com/kubernetes/kubernetes/pull/112977
It is not currently working on CRI-O with cgroupv2 the same way as for containerd or CRI-O with cgroupv1.

However, handling OOM failures is considered as a potential future work, which fits into the framework of pod failure conditions. In particular, we considered introducing a new pod condition, called ResourceExhausted to indicate the reason. We discuss it more here: https://github.com/kubernetes/enhancements/blob/9aa33b83c3fbaa73e4fa2b96335b5c7e613a2b51/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md#resource-limits-exceeded.

thockin commented 1 year ago

We wanted to avoid extending the Pod API to know about Jobs.

Oh, I didn't mean that. I meant that you can configure this in a Job and have that populate a pod spec stanza. For example, the KEP shows a case of FailJob if exitcode notIn 40-42. You could inform the kubelet to the effect of RestartOnFailure but only if exit code is 40-42.

It seems to me that it would be unfortunate to set all Job Pods to never restart. Maybe I am barking up the wrong tree.

we annotate disrupted pods with a dedicated DisruptionTarget condition

Can I suggest "DisruptionVictim" or even just "Disrupted" ? "Target" sounds like an objective, rather than a result.

We support expressing pod failure policies both based on exit codes and the reasons for failure (signals)

The word "signal" appears in the KEP one time, and not for this. What I meant is that you have an API to say "Fail this job except case of exit code X" but not "fail this job except signal Y". This forces a user who wants to handle a signal to encode the signal number and do math (which may not be obvious to them) rather than express it naturally. Why not have an OnExitSignal or:

onExit:
    operator: NotIn
        statuses: [40,41,42]
        signals: [SIGTERM]

Meaning "if I die of SIGTERM, don't fail"

This is what I meant about OOM - if I get OOM killed (SIGKILL) it's not a software bug, per se. The KEP says "All remaining exit codes indicate a software bug" - which just isn't true.

Regarding the API in this KEP:

One of OnExitCodes and onPodConditions, but not both, can be used in each rule

You should structure one-of blocks in their own struct. This means something like:

type PodFailurePolicyRule struct {
    Action PodFailurePolicyAction

        When PodFailurePolicyRuleWhen
}

type PodFailurePolicyRuleWhen struct {
    OnExitCodes *PodFailurePolicyOnExitCodesRequirement

    OnPodConditions []PodFailurePolicyOnPodConditionsPattern
}

mimowo commented 1 year ago

I meant that you can configure this in a Job and have that populate a pod spec stanza. For example, the KEP shows a case of FailJob if exitcode notIn 40-42. You could inform the kubelet to the effect of RestartOnFailure but only if exit code is 40-42.

I see, but continuing on the example, if the exit code was 43, then Kubelet would not restart the pod. However, this would not fail the entire Job (when FailJob action is used), as it typically uses backoffLimit (6 by default), so the pod would be recreated by the Job controller.

A similar consideration is when adding the FailIndex action as it was requested in the community: https://github.com/kubernetes/kubernetes/issues/109712#issuecomment-1387204137.

It seems to me that it would be unfortunate to set all Job Pods to never restart. Maybe I am barking up the wrong tree.

In some cases users may want to fail a Job as soon as a pod fails due to a reason indicating a software bug (to save costs). For example, the TensorFlow framework defines the list of exit codes indicating a permanent error (see, https://www.kubeflow.org/docs/components/training/tftraining/ under “The following exit codes indicate a permanent error and the container will not be restarted”).

Can I suggest "DisruptionVictim" or even just "Disrupted" ? "Target" sounds like an objective, rather than a result.

The Disrupted was proposed and discussed here: https://github.com/kubernetes/enhancements/pull/3463/files#r968840656. The idea for adding Target is to emphasize that the pod may not have actually been disrupted (as the condition is added before the actual Delete request).

In some situations, the controller which added the condition may crash before performing the actual deletion, see: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures#failing-delete-after-a-condition-is-added. After restart. it may not re-attempt the deletion, leaving the condition stale, which we discuss here: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures#stale-disruptiontarget-condition-which-is-not-cleaned-up.

The word "signal" appears in the KEP one time, and not for this. What I meant is that you have an API to say "Fail this job except case of exit code X" but not "fail this job except signal Y". This forces a user who wants to handle a signal to encode the signal number and do math (which may not be obvious to them) rather than express it naturally.

From the user / API point of view exit codes can be inspected in the pod status, so it is easier to debug them. Exit codes are also supported on Windows, which makes them more general.

If we go for a restart policy for signals, then the alternatives I see to consider:

Surface the received signal like exit codes, in status for terminated containers (could be matched by a new onSignals)
Surface the received signal in a pod condition (could be matched by existing onPodCondiitons)
Move the responsibility to Kubelet (IIUC this is similar to what you suggest)

If we go for 3, then, I think we could have a mechanism that is based on signals and exit codes restarts the pod. Here, if Kubelet restarts the pod without transitioning the pod to Failed phase, then the interim restarts could be transparent to the job controller. I think such a mechanism could coexist with what is proposed in KEP. I consider it more as a possible future extension (we may relax restartPolicy to OnFailure), rather than a replacement.

This is what I meant about OOM - if I get OOM killed (SIGKILL) it's not a software bug, per se. The KEP says "All remaining exit codes indicate a software bug" - which just isn't true.

Yes, the sentence comes for a user story which simplifies the real world, probably too much. We could make it more realistic if we followed the convention of TensorFlow.

One of OnExitCodes and onPodConditions, but not both, can be used in each rule

You should structure one-of blocks in their own struct.

I see, we started by allowing to set them both, connected with the AND semantics. Then, the outer API of rules could play a role of OR. However, for now, we dropped the idea for simplicity as there was no clear use case for the AND semantics, but it is something we may consider to relax in the future.

Currently the API is in Beta, so I guess we fall under: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#incompatible-api-changes.

jensentanlo commented 1 year ago

However, handling OOM failures is considered as a potential future work, which fits into the framework of pod failure conditions. In particular, we considered introducing a new pod condition, called ResourceExhausted to indicate the reason. We discuss it more here: https://github.com/kubernetes/enhancements/blob/9aa33b83c3fbaa73e4fa2b96335b5c7e613a2b51/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md#resource-limits-exceeded.

+1 for handling OOM failures in a direct way. First, let me say that the current feature already works great, especially for situations when the team that is running the job also controls the container (since we can programmatically return certain exit codes). However, for the situation where they are different (say you have a separate infra team), it would be great to be able to specify to not retry any OOM failures.

alculquicondor commented 1 year ago

We choose DisruptionTarget instead of Disrupted because the condition should describe the current state https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

At the time we add the condition, the Pod has been chosen for disruption, but it's not disrupted yet.

You should structure one-of blocks in their own struct.

The current conventions don't state a mandatory way of doing it. Should it be updated? Still, the API went through several rounds of API reviews. https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#unions

but I didn't really understand why NOT put at least the restart logic in Pod?

It's partially technical debt and partially consistency.

The job controller already handles a backoffLimit that the kubelet doesn't know about.
The failure policy needs to consider pod conditions which are not added by kubelet. There are cases when a pod fails, but there is no exit code (for example, when a node fully disappears, or the pod was not scheduled or when the kubelet didn't have a chance to even start the containers). So we needed a level of control in the job API.

We could potentially add exit code control in the PodSpec and kubelet in the future, but at the time the usefulness of it outside of jobs was not clear. If we ever decide to do it, we could have some validation rules for the Job API that you can only set the failure policy in the PodSpec or the higher-level Job field, but not both.

However, for the situation where they are different (say you have a separate infra team), it would be great to be able to specify to not retry any OOM failures.

The infra team can do it through a webhook that adds the rule for every job, using the exit code 137.

But in general, to better handle OOM in kubernetes, we need to standardize the behavior in the container runtimes, and we are not there yet https://github.com/kubernetes/kubernetes/issues/112910

This is what I meant about OOM - if I get OOM killed (SIGKILL) it's not a software bug, per se. The KEP says "All remaining exit codes indicate a software bug" - which just isn't true.

That was just one story. The API gives users the flexibility to react to 137 as they wish. The problem with OOM is that it can be initiated by node pressure (not a software bug) or it could be because of the memory limit was exceeded (a software bug). We cannot distinguish these two scenarios today.

alculquicondor commented 1 year ago

Related to the discussion about RestartPolicy=OnFailure. It's already problematic https://github.com/kubernetes/kubernetes/issues/109870

thockin commented 1 year ago

I meant that you can configure this in a Job and have that populate a pod spec stanza. For example, the KEP shows a case of FailJob if exitcode notIn 40-42. You could inform the kubelet to the effect of RestartOnFailure but only if exit code is 40-42.

I see, but continuing on the example, if the exit code was 43, then Kubelet would not restart the pod. However, this would not fail the entire Job (when FailJob action is used), as it typically uses backoffLimit (6 by default), so the pod would be recreated by the Job controller.

I see, so you REALLY do want to take full control of restarts, rather than telling kubelet to do it when possible. Is that the correct take-away?

What if you could tell kubelet "restart it on this error codes, but only up to N times" ?

From the user / API point of view exit codes can be inspected in the pod status, so it is easier to debug them.

Blech, ok. I concede that, but I'm not sure it erases the clearer expression of intent.

Exit codes are also supported on Windows

Is there no concept of external termination status on Windows?

If we go for a restart policy for signals, then the alternatives I see to consider:

If we want to add this to the intention (spec), we probably want to add it to pod status, too. Not instead of exit code, but in addition.

I consider it more as a possible future extension

I'm OK with that - maybe just make a note in the KEP, and let's make sure the API has room for it, if we want to add it.

You should structure one-of blocks in their own struct.

I see, we started by allowing to set them both, connected with the AND semantics. Then, the outer API of rules could play a role of OR. However, for now, we dropped the idea for simplicity as there was no clear use case for the AND semantics, but it is something we may consider to relax in the future.

What we should be doing for this is a set of single-choice cases, where one case is "just X", and another is "just Y" and a third is "X and Y". E.g.

type Whatever struct {
  // other stuff...

  WhichMode WhateverMode
}

// Exactly one of these should be set.
type WhateverMode struct {
  X *ModeX
  Y *ModeY
  XY *ModeXY
}

type ModeX struct {
  // Selects X
  Selector metav1.LabelSelector
}

type ModeY struct {
  // Selects Y
  Selector metav1.LabelSelector
}

type ModeX struct {
  XSelector metav1.LabelSelector
  YSelector metav1.LabelSelector
}

I know it is tedious, but it avoids future pain where multiple fields have weird relationships to each other.

You can choose to do a second beta and handle the case of both "beta1" and "beta2" (I know they are not actually versioned) fields manually. I don't know if it is really worthwhile at this stage, but all my warning alarms are going off without it. For example, NetworkPolicy did like you have here, and we regret it.

I won't block progress on this one - I got most of what I wanted, which was to understand it :)

soltysh commented 1 year ago

/milestone v1.27 /label lead-opted-in

alculquicondor commented 1 year ago

To clarify, we are not going for GA in 1.27, just doing one improvement while remaining beta.

alculquicondor commented 1 year ago

@liggitt as api-reviewer for Alpha/Beta. What do you think about the API changed proposed in https://github.com/kubernetes/enhancements/issues/3329#issuecomment-1409552875

Is it worth the effort going through a field deprecation process?

mimowo commented 1 year ago

I see, so you REALLY do want to take full control of restarts, rather than telling kubelet to do it when possible. Is that the correct take-away?

For now, yes. We leave it for the future to revisit the approach and extend it for restartPolicy=OnFailure (see more below).

What if you could tell kubelet "restart it on this error codes, but only up to N times" ?

Useful in the long term. Then we could delegate the interim restarts to Kubelet and only surface the failures when the entire pod failed. This clearly separates the concerns for restarts, the only complication I see is that this means that there might be more restarts than the backoffLimit for a Job as a total, so we may need relax it. There are issues with the backoffLimit when restartPolicy=OnFailure, as pointed out by Aldo above. Once we solve the issues we could enable podFailurePolicy for restartPolicy=OnFailure. I think it deserves its own KEP.

Blech, ok. I concede that, but I'm not sure it erases the clearer expression of intent.

Well, the signals by themselves are also not very good at expressing the intent. For example, SIGKILL (translated to 137) might be sent in many scenarios, including OOM. In order to give users more control, we planned to standardize how different container runtimes handle OOM and surface this information as a pod condition ResourceExhausted (to also put exceeded disk limits under the same one). We see usefulness of that and it seems that the remaining fixes to CRI-O are on the way, once standardized, it seems that adding the condition will be done in either a separate KEP, or in a later Beta of this KEP, though I think the next follow up KEP will be better to avoid prolonging graduation of this one.

I'm OK with that - maybe just make a note in the KEP, and let's make sure the API has room for it, if we want to add it.

I will try to add it as a note in a pending KEP update or a follow up.

Is there no concept of external termination status on Windows?

I guess, so, but I’m not an expert on that. Apparently, underneath messages such as WM_CLOSE are used: https://learn.microsoft.com/en-us/windows/win32/winmsg/wm-close

It also seems that windows implements some layer of compatibility with the signals, but the does not use them on its own: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/signal?view=msvc-170 (“The SIGILL and SIGTERM signals aren't generated under Windows. They're included for ANSI compatibility”).

You can choose to do a second beta and handle the case of both "beta1" and "beta2" (I know they are not actually versioned) fields manually. I don't know if it is really worthwhile at this stage, but all my warning alarms are going off without it. For example, NetworkPolicy did like you have here, and we regret it.

My inclination is to keep it as is. First, it has passed API review, and the need for combining exitCodes with podConditions hasn’t surfaced when discussing different use-cases. We discussed allowing onPodConditions with onExitCodes, combined with logical AND, but the use case wasn’t clear. In the future we could either go this path, or add another dedicated field which allows us to specify the various modes of combining the matchers, similar to what you suggest.

I won't block progress on this one - I got most of what I wanted, which was to understand it :)

Thank you for looking at the KEP :).

liggitt commented 1 year ago

skimming the discussion, my takeaways are:

there's information that could be useful (signal information, etc) that isn't currently available from container runtimes or the pod API today
- if that information was made available in the future, it could be added as one of the things the job API considers to make it more natural to express some of the scenarios motivating this feature
- however, even if that information was made available in the future, we would still need the ability to consider exit codes
- starting with expressing job retry policy in terms of the information we currently have from container runtimes seems sensible
there's two levels of retry that could happen... node-local (defined in the pod, informing kubelet behavior), and job-level (defined in the job, informing job controller behavior)
- node-local retry could gain sophistication in the future
- even if node-local retry did get expanded/enhanced, job-level retry would still be necessary (nodes go away, or become disqualified from running a particular workload via taint or resource availability or preemption or ...)
- starting with expressing job retry policy to inform what the job controller is already doing seems sensible
for the structure of the failure policy rules, I don't see those as union types, but as policy matching requirements
- I think it is more powerful and more succinct to keep the requirements distinct and let multiple be specified than to require a new grouping... Tim's comment ("What we should be doing for this is a set of single-choice cases, where one case is "just X", and another is "just Y" and a third is "X and Y".) doesn't seem like it scales well
- as a thought experiment, imagine signal info gets added in the future, so we could match on exit codes, conditions, or signals, and for some reason, it became important to match on combinations of these things. To allow matching combinations with the "single-choice" approach, you'd need 7 fields:
- onExitCodes
- onConditions
- onSignals
- onExitCodesAndSignals
- onExitCodesAndConditions
- onExitCodesAndConditionsAndSignals
- onConditionsAndSignals

thockin commented 1 year ago

WRT one-off lol-ness:. The alternative to explicit is implicit. Once you define combinations of sometimes-optional fields, you are effectively giving a meaning to "unspecified".

It might seem OK now, but it seemed ok on network policy too, and we really regretted it Adding a net-new criterion which doesn't relate to those other fields tickles the default meaning of the unspecified fields.

Again, it might be OK here, but my own crystal ball is ineffective, so I prefer to be explicit.

Probably not worth retooling at this point anyway.

On Wed, Feb 1, 2023, 6:48 AM Jordan Liggitt @.***> wrote:

skimming the discussion, my takeaways are:

there's information that could be useful (signal information, etc) that isn't currently available from container runtimes or the pod API today

if that information was made available in the future, it could be added as one of the things the job API considers to make it more natural to express some of the scenarios motivating this feature

however, even if that information was made available in the future, we would still need the ability to consider exit codes

starting with expressing job retry policy in terms of the information we currently have from container runtimes seems sensible

there's two levels of retry that could happen... node-local (defined in the pod), and job-level (defined in the job)

node-local retry could gain sophistication in the future

even if node-local retry did get expanded/enhanced, job-level retry would still be necessary (nodes go away, or become disqualified from running a particular workload via taint or resource availability or preemption or ...)

starting with expressing job retry policy to inform what the job controller is already doing seems sensible

for the structure of the failure policy rules, I don't see those as union types, but as policy matching requirements

I think it is more powerful and more succinct to keep the requirements distinct and let multiple be specified than to require a new grouping... Tim's comment ("What we should be doing for this is a set of single-choice cases, where one case is "just X", and another is "just Y" and a third is "X and Y".) doesn't seem like it scales well

as a thought experiment, imagine signal info gets added in the future, so we could match on exit codes, conditions, or signals, and for some reason, it became important to match on combinations of these things. To allow matching combinations with the "single-choice" approach, you'd need 7 fields:

onExitCodes

onConditions

onSignals

onExitCodesAndSignals

onExitCodesAndConditions

onExitCodesAndConditionsAndSignals

onConditionsAndSignals

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/enhancements/issues/3329#issuecomment-1412180770, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVG6HXSGGUPT6CX7SNTWVJZTBANCNFSM5XSDYNYA . You are receiving this because you commented.Message ID: @.***>

marosset commented 1 year ago

Hello @mimowo 👋, Enhancements team here.

Just checking in as we approach enhancements freeze on 18:00 PDT Thursday 9th February 2023.

This enhancement is targeting for a second beta for v1.27 (correct me, if otherwise)

Here's where this enhancement currently stands:

[x] KEP readme using the latest template has been merged into the k/enhancements repo.
[ ] KEP status is marked as implementable for latest-milestone: v1.27
[x] KEP readme has a updated detailed test plan section filled out
[ ] KEP readme has up to date graduation criteria
[x] KEP has a production readiness review that has been completed and merged into k/enhancements.

For this enhancement, it looks like https://github.com/kubernetes/enhancements/pull/3757 will address the remaining requirements.

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

shatoboar commented 1 year ago

Hi @alculquicondor - With https://github.com/kubernetes/enhancements/pull/3757 merged this enhancement is now tracked for now inclusion in the v1.27 release.

Thanks!

shatoboar commented 1 year ago

Hi @mimowo and @alculquicondor 👋, Checking in as we approach 1.27 code freeze at 17:00 PDT on Tuesday 14th March 2023. Please ensure the following items are completed:

[X] All PRs to the Kubernetes repo that are related to your enhancement are linked in the above issue description (for tracking purposes).
[ ] All PRs are fully merged by the code freeze deadline.

For this enhancement, it looks like the following PRs are open and need to be merged before code freeze:

https://github.com/kubernetes/kubernetes/pull/115331

Please let me know what other PRs in k/k I should be tracking for this KEP. As always, we are here to help should questions come up. Thanks!

atiratree commented 1 year ago

hi @mimowo and @alculquicondor, I have found an issue with API-initiated eviction when used together with PodDisruptionConditions https://github.com/kubernetes/kubernetes/issues/116552, I am trying to fix it in https://github.com/kubernetes/kubernetes/pull/116554

SergeyKanzhelev commented 1 year ago

/milestone v1.28

since there will be work for 1.28

kubernetes / enhancements

Retriable and non-retriable Pod failures for Jobs #3329

Enhancement Description

3374

3438

3447

3452

3463

3646

3769

3757