run-house / runhouse

Dispatch and distribute your ML training to "serverless" clusters in Python, like PyTorch for ML infra. Iterable, debuggable, multi-cloud/on-prem, identical across research and production.
https://run.house
Apache License 2.0
961 stars 37 forks source link

Running jobs in a Slurm cluster #16

Open ThomasA opened 1 year ago

ThomasA commented 1 year ago

The feature It would be interesting if Runhouse could also interface to a cluster in the form of a an existing Slurm cluster.

Motivation I am part of a team managing a Slurm (GPU) cluster. On the other hand, I have users who are interested in being able to run large language models via Runhouse (https://langchain.readthedocs.io/en/latest/modules/llms/integrations/self_hosted_examples.html). It would be excellent if I could bridge this gap between supply and demand with Runhouse. From what I have read in the documentation so far, Runhouse does not seem to come with an interface to Slurm so far.

What the ideal solution looks like I am completely new to Runhouse, so this may not be the ideal solution model, but I imagine this could be supported as a bring-your-own cluster with a little bit of extra interaction between Runhouse and Slurm to request the necessary resources (maybe from the Cluster factory method) as a job / jobs in Slurm (probably through the Slurm REST API). Once the jobs are running, the nodes involved can be contacted by Runhouse as a BYO cluster.

dongreenberg commented 1 year ago

This is very interesting and we've actually been digging into it for a few weeks. It seems doable pending a few questions about how the cluster is set up. Would you be willing to jump on a quick call just to answer a few questions and talk through our approach?

ThomasA commented 1 year ago

I would like to help out and can probably also help test it in our cluster. I can be available for a call at US-friendly times tomorrow and maybe Friday. Can you email me to coordinate?

andre15silva commented 1 year ago

I am also interested in getting runhouse interfacing with a Slurm cluster.

Has there been any progress recently on this issue?

jlewitt1 commented 1 year ago

Hey Andre we're still in the POC stage - we'd be happy to speak with you to hear more about your requirements and how that integration can work for your setup. Just sent you an email to coordinate

eugene-prout commented 1 year ago

Hello,

I have a similar setup to those above and would like to try out Runhouse. Are there any updates to this issue? I have experience interfacing with Slurm clusters so would be happy to contribute if that would help get this past POC.

jlewitt1 commented 1 year ago

Hi Eugene thanks for reaching out! We'd love to support slurm, it's on our roadmap along with other compute providers (e.g. k8s), and we hope to get the slurm support into the next or following release. In the meantime we'd be happy to hear your thoughts and possible contribution on this! Sent you an email to discuss further