Open kerthcet opened 1 year ago
sgtm
I see in #1269 @kerthcet mentioned
but this involves contributions to volcano upstream, still WIP.
anyone have additional context? I'd be interested in seeing this land and would like to understand current blockers/if the community can help? I found https://kueue.sigs.k8s.io/docs/tasks/integrate_a_custom_job/ and other implementations but it wasn't clear to me if there are any missing dependencies/or quirks in functionality (I see some called out for Ray)
Yes, we have a friend @GhangZh who already has some practice with this, but this requires supporting Suspend
in volcano project, but I didn't have enough time right now, we need volunteers.
By the way, can you describe your scenarios which can help us better understand the feature. @alexeldeib
We use volcano plugins to do some automatic field injection today (pytorch, ssh for mpirun, similar stuff), we can probably do it the manual way with kueue + jobsets, although it's a nice convenience layer to write simpler manifests.
I see Suspend exists in Volcano APIs/CLI today? What's actually missing?
looks like it's an issue of the kueue <-> volcano APIs? volcano uses commands but kueue expected a spec field to edit with no return value to pass to client.Update() https://github.com/kubernetes-sigs/kueue/blob/3b37fbf0a06f7778d40bc656bbe312aeaabcc2e9/pkg/controller/jobframework/reconciler.go#L252-L253
You can also try to run PyTorch and MPI with kubeflow, for which the support is already there.
kueue expected a spec field to edit with no return value to pass to client.Update()
Kueue needs to be able to tell Volcano when it's time to create Pods (because the job was admitted) or when it's time to delete pods (because the job was preempted).
volcano uses commands
Not sure what you mean.
volcano uses commands
Not sure what you mean.
I was trying to follow the issue with Suspend. Maybe there's no issue :D
It looked to me like kueue expects a declarative way to do things like job.Suspend
volcano has some commands which create separate CRDs which a reconciler acts on to mutate Job state accordingly (something like a shared bus for communication?). I don't see how that can fit into the suspend model kueue expects.
I see volcano can do it declaratively too though, if you just set .status.state.phase = aborted
? or similar
Honestly, I'm leaning kueue + jobset + a thin layer to replace volcano plugins to avoid human errors in yaml-ing. maybe kustomize functions or helm or something, not sure...
That is a very uncommon design 🤔
I would imagine that at some point the "command" translates into a change into a Job CRD. But I could be mistaken.
In any case, the job reconciler supports a few different interfaces that allows you modify how multiple actions are performed that don't necessarily assume a declarative API.
for posterity for future aspirational users/issue solvers
I would imagine that at some point the "command" translates into a change into a Job CRD. But I could be mistaken.
yeah, it looks like the second set of links might work, and I think all the actions controller does is translate that into the status field like you said. I haven't tried manually setting aborted/aborting phase from volcano. frankly that design confused me too. It looks like it'd work as I described declaratively, but you're effectively driving desired state from status, which is weird. Maybe that's why it's a separate CRD? I'd personally do a spec field, but I see the semantics are weird -- something like "suspend" is a user-initiated spec/action change, while you want to track status of the job separately shrugs
I probably will not personally pursue this path further but sharing my thoughts for anyone who chooses to, and thanks @alculquicondor and @kerthcet for the quick replies/tips 🙂
(and I’ll check out kubeflow — sounds like maybe that’s along the lines of what I want)
I was trying to follow the issue with Suspend. Maybe there's no issue :D
I checked the volcano code, the aborted action is somehow similar to Suspend
as you said, I think we can use that directly.
We use volcano plugins to do some automatic field injection today (pytorch, ssh for mpirun, similar stuff), we can probably do it the manual way with kueue + jobsets
Then if I understood correctly, heading with this way, you're no longer need the volcano anymore, what we want to do here is we hope kueue can manage vcjob as well.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Will talk with @GhangZh offline to see how to push this forward. /remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
What would you like to be added:
Support volcano job(aka. vcjob)
Why is this needed:
As we know, volcano also supports job queueing, but in a cluster with multi schedulers or in a multi-cluster scenario, we hope we can have a resource management component in the front. AFAIK, volcano also supports suspend semantic.
also cc @GhangZh who has some experiments on this.
This can be an experimental support.
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.