kubernetes-sigs / jobset

JobSet: An API for managing a group of Jobs as a unit
https://jobset.sigs.k8s.io/
Apache License 2.0
113 stars 36 forks source link

Simpler top level status fields for total, succeeded and failed jobs #545

Open shrinandj opened 2 months ago

shrinandj commented 2 months ago

What would you like to be added: Currently, the status of a jobset object consists only of two arrays: conditions and replicatedJobsStatus. This is great for providing detailed status of individual conditions and replicated jobs.

However, it makes referencing the overall status of the jobset a little harder.

Along with the current conditions and replicatedJobsStatus fields, it will be useful to have at least 3 additional simple fields in the status indicating the overall status of the jobset, similar to K8s jobs:

status:
  totalJobs: 2
  succeededJobs: 2
  failedJobs: 0
  conditions:
  - lastTransitionTime: "2024-04-22T18:10:09Z"
....
  replicatedJobsStatus:
  - active: 0
...

Why is this needed: We were trying to run Jobsets via Argo-workflows as part of a distributed training setup. Argo-workflows can create Kubernetes resources. In order to do this correctly, Argo allows users to specify a successCondition and failureCondition. These conditions need to be simple key=value conditions. Referencing array elements, filtering for specific conditions, etc. is not possible.

If the new fields are available, adding them to Argo or even using any other K8s tools would become that much easier:

      successCondition: status.succeededJobs = status.totalJobs
      failureCondition: status.failedJobs > 0
kannon92 commented 2 months ago

Is it not possible to use Conditions for this purpose? We emit a conditon with the job is successful or failed and you could check that.

kannon92 commented 2 months ago

The reason we have an array of this is because replicated jobs can all finish at different times so we can't really flatten the array to report jobs without that information.

shrinandj commented 2 months ago

In some cases, it is not possible to use array notations and filters for specific conditions.

In the case of Argo workflows, they use a label selector to parse the success and failure conditions (which I see only works with key = value pattern).

Is it possible to have totalJobs in status populated when the jobset is created and then populate succeededJobs and failedJobs when the AllJobsCompleted condition is being populated?

kannon92 commented 2 months ago

The concern is that I think that would be insufficient for all use cases. If you had a lot of replicated jobs you may have some of the jobs finishing while others are not.

say JobSet with rep a, rep b, and rep c jobs starting in parallel. User really wants all of rep c to finish to declare the job successful but rep a and rep b finish and user uses a count based approach. This would end up with a case where you would declare the job successfully (if totalJobs is greater than value suggested). If one specified success_jobs = 2.

Other counter point is that we don't require certain replicated jobs to finish (say driver/worker where workers finish). In this case our job is successfully but the driver doesn't technically finish.

I think the condition is going to be the reliable approach.

kannon92 commented 2 months ago

@ahg-g @danielvegamyhre WDYT?

shrinandj commented 2 months ago

The goal wouldn't be to remove existing conditions or replicatedJobStatuses. They have to be there to present the complete picture. The top level numerical metrics provide (a) a summary of overall status and (b) usable for cases like the one mentioned above.

googs1025 commented 2 months ago

Maybe we can have a jobset's own status, so that argo workflow can also use this status to determine whether it is completed.

NAME             RESTARTS   COMPLETED   SUSPENDED   AGE.   Status
failure-policy   3                                  67m     Running
network-jobset                                      54m     Succeed
paralleljobs                True                    55m.    Failed
shrinandj commented 2 months ago

Sure.. Having a Jobset's own top level status field simply indicating whether it is Running, Succeeded, Failed (or Suspended) would work too.

kannon92 commented 2 months ago

I think a phase field in status would be useful. I was surprised we don't actually have one.

@ahg-g @danielvegamyhre

WDYT of adding a phase field in JobSet.Status that has the options of "Running, Succeeded, Failed or Suspended"?

danielvegamyhre commented 2 months ago

Yeah I agree. This would be useful for higher level tools which use JobSet as a building block, like XPK.

kannon92 commented 2 months ago

Nice idea @googs1025!

ahg-g commented 2 months ago

Conditions is the new "Phase". But if phase makes it easier to consume, I am fine with that.

shrinandj commented 2 months ago

If there is a jobset with multiple jobs and say some of them are running while the others have failed, what would the status of the jobset be: Running? Failed?

One benefit of simply having the numbers (totalJobs, succeeded, failed) is that it might slightly less ambiguous and it leaves the caller to decide how to interpret the numbers. And if they care specifically about which job succeeded or failed, they can look at conditions or replicatedJobStatuses.

danielvegamyhre commented 2 months ago

If there is a jobset with multiple jobs and say some of them are running while the others have failed, what would the status of the jobset be: Running? Failed?

One benefit of simply having the numbers (totalJobs, succeeded, failed) is that it might slightly less ambiguous and it leaves the caller to decide how to interpret the numbers. And if they care specifically about which job succeeded or failed, they can look at conditions or replicatedJobStatuses.

It would depend on the success policy and failure policy defined in the JobSet.

Defaults are:

These policies are configurable though.

ahg-g commented 2 months ago

basically phase will be set once the JobSet reaches a terminal state

kannon92 commented 2 months ago

@ahg-g is correct. "Conditions" are the reliable option for most kubernetes controllers. Job uses conditions to detect certain states and other workflows support that.

It would be ideal to work with argo-workflows on supporting conditions for extended resources. For JobSet I think a phase is fine but I can't say the same for other projects.

Happy to support the jobset change though.

shrinandj commented 2 months ago

Just to make this concrete, the new status field will be called phase and the status of the jobset will look like this, right?

status:
  phase: running # or suspended | succeeded | failed 
  conditions:
  - lastTransitionTime: "2024-04-22T18:10:09Z"
....
  replicatedJobsStatus:
  - active: 0
...
jessicaxiejw commented 1 month ago

@shrinandj Are you planning on merging the jobset integration with argo workflow upstream? I am looking into implementing something similar.

shrinandj commented 1 month ago

Are you planning on merging the jobset integration with argo workflow upstream? I am looking into implementing something similar.

I haven't been able to get to this yet. If you want to go ahead with the implementation, I can help review it and test it and provide any other feedback.

Just to ensure we're on the same page: Argo can create any K8s resources based on a yaml/json file. But it needs a way to easily read back the status of the created object. This issue is to make that possible by having a simple high level status.

So, by "jobset integration with argo workflow", you mean adding this status field in the jobset object, correct?

ahg-g commented 1 month ago

I am supportive of this change if someone would like to take on this task, and we can track it for 0.7.0

shrinandj commented 4 weeks ago

@jessicaxiejw Let me know when you have something ready to test. I can certainly take it out for a spin and provide feedback.

danielvegamyhre commented 3 weeks ago

@jessicaxiejw if you want to implement this that would be super helpful! Please go ahead and assign yourself if you plan on working on this, or let us know if not and we can find another owner.

jessicaxiejw commented 3 weeks ago

Sorry about the late response. I am not working on it right now 😅 @shrinandj feel free taking a stab at it.

We decided to go with a completely different direction.

googs1025 commented 3 weeks ago

/assign

googs1025 commented 3 weeks ago

I want to try it. This is my opinion on the current situation. @danielvegamyhre @kannon92

// the status of JobSet
const (
        // JobSetRunning means the job is running.
    JobSetRunning JobSetConditionType = "Running"
    // JobSetCompleted means the job has completed its execution.
    JobSetCompleted JobSetConditionType = "Completed"
    // JobSetFailed means the job has failed its execution.
    JobSetFailed JobSetConditionType = "Failed"
    // JobSetSuspended means the job is suspended.
    JobSetSuspended JobSetConditionType = "Suspended"
)