Open shrinandj opened 2 months ago
Is it not possible to use Conditions for this purpose? We emit a conditon with the job is successful or failed and you could check that.
The reason we have an array of this is because replicated jobs can all finish at different times so we can't really flatten the array to report jobs without that information.
In some cases, it is not possible to use array notations and filters for specific conditions.
In the case of Argo workflows, they use a label selector to parse the success and failure conditions (which I see only works with key
= value
pattern).
Is it possible to have totalJobs
in status populated when the jobset is created and then populate succeededJobs
and failedJobs
when the AllJobsCompleted
condition is being populated?
The concern is that I think that would be insufficient for all use cases. If you had a lot of replicated jobs you may have some of the jobs finishing while others are not.
say JobSet with rep a, rep b, and rep c jobs starting in parallel. User really wants all of rep c to finish to declare the job successful but rep a and rep b finish and user uses a count based approach. This would end up with a case where you would declare the job successfully (if totalJobs is greater than value suggested). If one specified success_jobs = 2.
Other counter point is that we don't require certain replicated jobs to finish (say driver/worker where workers finish). In this case our job is successfully but the driver doesn't technically finish.
I think the condition is going to be the reliable approach.
@ahg-g @danielvegamyhre WDYT?
The goal wouldn't be to remove existing conditions
or replicatedJobStatuses
. They have to be there to present the complete picture. The top level numerical metrics provide (a) a summary of overall status and (b) usable for cases like the one mentioned above.
Maybe we can have a jobset's own status, so that argo workflow can also use this status to determine whether it is completed.
NAME RESTARTS COMPLETED SUSPENDED AGE. Status
failure-policy 3 67m Running
network-jobset 54m Succeed
paralleljobs True 55m. Failed
Sure.. Having a Jobset's own top level status field simply indicating whether it is Running, Succeeded, Failed (or Suspended
) would work too.
I think a phase field in status would be useful. I was surprised we don't actually have one.
@ahg-g @danielvegamyhre
WDYT of adding a phase field in JobSet.Status that has the options of "Running, Succeeded, Failed or Suspended"?
Yeah I agree. This would be useful for higher level tools which use JobSet as a building block, like XPK.
Nice idea @googs1025!
Conditions is the new "Phase". But if phase makes it easier to consume, I am fine with that.
If there is a jobset with multiple jobs and say some of them are running while the others have failed, what would the status of the jobset be: Running? Failed?
One benefit of simply having the numbers (totalJobs, succeeded, failed) is that it might slightly less ambiguous and it leaves the caller to decide how to interpret the numbers. And if they care specifically about which job succeeded or failed, they can look at conditions
or replicatedJobStatuses
.
If there is a jobset with multiple jobs and say some of them are running while the others have failed, what would the status of the jobset be: Running? Failed?
One benefit of simply having the numbers (totalJobs, succeeded, failed) is that it might slightly less ambiguous and it leaves the caller to decide how to interpret the numbers. And if they care specifically about which job succeeded or failed, they can look at
conditions
orreplicatedJobStatuses
.
It would depend on the success policy and failure policy defined in the JobSet.
Defaults are:
These policies are configurable though.
basically phase will be set once the JobSet reaches a terminal state
@ahg-g is correct. "Conditions" are the reliable option for most kubernetes controllers. Job uses conditions to detect certain states and other workflows support that.
It would be ideal to work with argo-workflows on supporting conditions for extended resources. For JobSet I think a phase is fine but I can't say the same for other projects.
Happy to support the jobset change though.
Just to make this concrete, the new status field will be called phase
and the status of the jobset will look like this, right?
status:
phase: running # or suspended | succeeded | failed
conditions:
- lastTransitionTime: "2024-04-22T18:10:09Z"
....
replicatedJobsStatus:
- active: 0
...
@shrinandj Are you planning on merging the jobset integration with argo workflow upstream? I am looking into implementing something similar.
Are you planning on merging the jobset integration with argo workflow upstream? I am looking into implementing something similar.
I haven't been able to get to this yet. If you want to go ahead with the implementation, I can help review it and test it and provide any other feedback.
Just to ensure we're on the same page: Argo can create any K8s resources based on a yaml/json file. But it needs a way to easily read back the status of the created object. This issue is to make that possible by having a simple high level status.
So, by "jobset integration with argo workflow", you mean adding this status field in the jobset object, correct?
I am supportive of this change if someone would like to take on this task, and we can track it for 0.7.0
@jessicaxiejw Let me know when you have something ready to test. I can certainly take it out for a spin and provide feedback.
@jessicaxiejw if you want to implement this that would be super helpful! Please go ahead and assign yourself if you plan on working on this, or let us know if not and we can find another owner.
Sorry about the late response. I am not working on it right now 😅 @shrinandj feel free taking a stab at it.
We decided to go with a completely different direction.
/assign
I want to try it. This is my opinion on the current situation. @danielvegamyhre @kannon92
// the status of JobSet
const (
// JobSetRunning means the job is running.
JobSetRunning JobSetConditionType = "Running"
// JobSetCompleted means the job has completed its execution.
JobSetCompleted JobSetConditionType = "Completed"
// JobSetFailed means the job has failed its execution.
JobSetFailed JobSetConditionType = "Failed"
// JobSetSuspended means the job is suspended.
JobSetSuspended JobSetConditionType = "Suspended"
)
What would you like to be added: Currently, the
status
of a jobset object consists only of two arrays:conditions
andreplicatedJobsStatus
. This is great for providing detailed status of individual conditions and replicated jobs.However, it makes referencing the overall status of the jobset a little harder.
Along with the current
conditions
andreplicatedJobsStatus
fields, it will be useful to have at least 3 additional simple fields in the status indicating the overall status of the jobset, similar to K8s jobs:Why is this needed: We were trying to run Jobsets via Argo-workflows as part of a distributed training setup. Argo-workflows can create Kubernetes resources. In order to do this correctly, Argo allows users to specify a
successCondition
andfailureCondition
. These conditions need to be simplekey=value
conditions. Referencing array elements, filtering for specific conditions, etc. is not possible.If the new fields are available, adding them to Argo or even using any other K8s tools would become that much easier: