kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.32k stars 230 forks source link

Support vcjob in kueue #1204

Open kerthcet opened 10 months ago

kerthcet commented 10 months ago

What would you like to be added:

Support volcano job(aka. vcjob)

Why is this needed:

As we know, volcano also supports job queueing, but in a cluster with multi schedulers or in a multi-cluster scenario, we hope we can have a resource management component in the front. AFAIK, volcano also supports suspend semantic.

also cc @GhangZh who has some experiments on this.

This can be an experimental support.

Completion requirements:

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

alculquicondor commented 10 months ago

sgtm

alexeldeib commented 7 months ago

I see in #1269 @kerthcet mentioned

but this involves contributions to volcano upstream, still WIP.

anyone have additional context? I'd be interested in seeing this land and would like to understand current blockers/if the community can help? I found https://kueue.sigs.k8s.io/docs/tasks/integrate_a_custom_job/ and other implementations but it wasn't clear to me if there are any missing dependencies/or quirks in functionality (I see some called out for Ray)

kerthcet commented 7 months ago

Yes, we have a friend @GhangZh who already has some practice with this, but this requires supporting Suspend in volcano project, but I didn't have enough time right now, we need volunteers.

By the way, can you describe your scenarios which can help us better understand the feature. @alexeldeib

alexeldeib commented 7 months ago

We use volcano plugins to do some automatic field injection today (pytorch, ssh for mpirun, similar stuff), we can probably do it the manual way with kueue + jobsets, although it's a nice convenience layer to write simpler manifests.

I see Suspend exists in Volcano APIs/CLI today? What's actually missing?

alexeldeib commented 7 months ago

looks like it's an issue of the kueue <-> volcano APIs? volcano uses commands but kueue expected a spec field to edit with no return value to pass to client.Update() https://github.com/kubernetes-sigs/kueue/blob/3b37fbf0a06f7778d40bc656bbe312aeaabcc2e9/pkg/controller/jobframework/reconciler.go#L252-L253

alculquicondor commented 7 months ago

You can also try to run PyTorch and MPI with kubeflow, for which the support is already there.

kueue expected a spec field to edit with no return value to pass to client.Update()

Kueue needs to be able to tell Volcano when it's time to create Pods (because the job was admitted) or when it's time to delete pods (because the job was preempted).

volcano uses commands

Not sure what you mean.

ace-cohere commented 7 months ago

volcano uses commands

Not sure what you mean.

I was trying to follow the issue with Suspend. Maybe there's no issue :D

It looked to me like kueue expects a declarative way to do things like job.Suspend

volcano has some commands which create separate CRDs which a reconciler acts on to mutate Job state accordingly (something like a shared bus for communication?). I don't see how that can fit into the suspend model kueue expects.

I see volcano can do it declaratively too though, if you just set .status.state.phase = aborted? or similar

Honestly, I'm leaning kueue + jobset + a thin layer to replace volcano plugins to avoid human errors in yaml-ing. maybe kustomize functions or helm or something, not sure...

alculquicondor commented 7 months ago

That is a very uncommon design 🤔

I would imagine that at some point the "command" translates into a change into a Job CRD. But I could be mistaken.

In any case, the job reconciler supports a few different interfaces that allows you modify how multiple actions are performed that don't necessarily assume a declarative API.

ace-cohere commented 7 months ago

for posterity for future aspirational users/issue solvers

I would imagine that at some point the "command" translates into a change into a Job CRD. But I could be mistaken.

yeah, it looks like the second set of links might work, and I think all the actions controller does is translate that into the status field like you said. I haven't tried manually setting aborted/aborting phase from volcano. frankly that design confused me too. It looks like it'd work as I described declaratively, but you're effectively driving desired state from status, which is weird. Maybe that's why it's a separate CRD? I'd personally do a spec field, but I see the semantics are weird -- something like "suspend" is a user-initiated spec/action change, while you want to track status of the job separately shrugs

I probably will not personally pursue this path further but sharing my thoughts for anyone who chooses to, and thanks @alculquicondor and @kerthcet for the quick replies/tips 🙂

(and I’ll check out kubeflow — sounds like maybe that’s along the lines of what I want)

kerthcet commented 7 months ago

I was trying to follow the issue with Suspend. Maybe there's no issue :D

I checked the volcano code, the aborted action is somehow similar to Suspend as you said, I think we can use that directly.

kerthcet commented 7 months ago

We use volcano plugins to do some automatic field injection today (pytorch, ssh for mpirun, similar stuff), we can probably do it the manual way with kueue + jobsets

Then if I understood correctly, heading with this way, you're no longer need the volcano anymore, what we want to do here is we hope kueue can manage vcjob as well.

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet commented 4 months ago

Will talk with @GhangZh offline to see how to push this forward. /remove-lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet commented 1 month ago

/remove-lifecycle stale