If I can't configure Slurm on a cluster, does that mean I can't use multi-node multi-GPU setups?

pjlab-sys4nlp / llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

https://arxiv.org/abs/2406.16554

Apache License 2.0

849 stars 44 forks source link

If I can't configure Slurm on a cluster, does that mean I can't use multi-node multi-GPU setups? #64

Closed rzr002 closed 6 months ago

rzr002 commented 6 months ago

sbatch: command not found

Since I have to launch a new container environment for each experiment and I have never worked with Slurm before, and this GPU cluster doesn't come with Slurm installed, I found that installing and configuring Slurm seems quite troublesome. Is there any other way to implement multi-node, multi-GPU setups?thanks

Spico197 commented 6 months ago

Hi there, thanks for your attention on this project~

Things would be easier if you don't use Slurm as the task scheduler. You can ignore the leading #SBATCH prefixes in the sbatch files, and take the sbatch file like a vanilla bash script.

To make it running, you should:

remove SLURM-related environment variables
launch torchrun on each node.

You may check the official tutorial from pytorch for multi-node training: https://pytorch.org/tutorials/intermediate/ddp_series_multinode.html