skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.31k stars 434 forks source link

[Clouds] Support slurm as a backend #3072

Open Michaelvll opened 5 months ago

Michaelvll commented 5 months ago

Users were asking how SkyPilot should interact with slurm clusters. We should think of how we should handle the case for slurm, i.e. whether to treat it as a job scheduler only or a way to start new clusters.

michaelzhiluo commented 5 months ago

A huge feature that users want in Slurm as well as Kubernetes is the ability to alloc or reserve parts of each node/creation of virtual clusters. This perfectly fits Skypilot's vision, as this is already implemented in Skypilot Kubernetes.

Regarding job scheduler, we already have an implementation for Slurm and is under the domain of another lab project. Let's discuss further if needed.