Closed rzr002 closed 6 months ago
Hi there, thanks for your attention on this project~
Things would be easier if you don't use Slurm as the task scheduler. You can ignore the leading #SBATCH
prefixes in the sbatch files, and take the sbatch file like a vanilla bash script.
To make it running, you should:
SLURM
-related environment variablesYou may check the official tutorial from pytorch for multi-node training: https://pytorch.org/tutorials/intermediate/ddp_series_multinode.html
sbatch: command not found
Since I have to launch a new container environment for each experiment and I have never worked with Slurm before, and this GPU cluster doesn't come with Slurm installed, I found that installing and configuring Slurm seems quite troublesome. Is there any other way to implement multi-node, multi-GPU setups?thanks