kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 701 forks source link

Create Slurm runtime for model training using V2 APIs #2249

Open andreyvelich opened 2 months ago

andreyvelich commented 2 months ago

What you would like to be added?

As we discussed during the last Training WG call, we want to design and implement Training Runtime for Slurm, so users can leverage Slurm workload manager for model training on Kubernetes.

Recordings: https://youtu.be/IBDyYUbB0UA

We can continue discussions once we implement the Training Operator V2 APIs.

cc @kubeflow/wg-training-leads @catblade

/area runtime

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich commented 2 months ago

/remove-label lifecycle/needs-triage