Open sadamov opened 1 month ago
An update on my testing of this: The SLURM constants are read correctly also on our cluster, but I have yet to be able to get multi-node training working. I think this is however unrelated to this code, but rather related to me not having the correct setup for running multi-node on our cluster. Will ask around to see if I can get it working.
In the meantime, @leifdenby (or anyone at DMI :smile:), do you have a slurm setup that you could test this on? I just think it's a good idea to test on multiple different clusters to make sure that this is general enough.
I have implemented the latest feedback, updated the CHANGELOG and added a SLURM-example submission script to /docs/examples
(is that a good location?) as discussed with @leifdenby. A new small section was added to the README.md
.
@joeloskarsson yes, every cluster is different and I also have to adapt my submission scripts after major changes. Do you have ticket-support with your HPC-provider? They usually know what to do...
Enable multi-node GPU training with SLURM
This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow the code to detect if it is running within a SLURM job and automatically configure the number of devices and nodes based on the SLURM environment variables.
Key changes
use_distributed_sampler
toTrue
when not in evaluation mode to enable distributed trainingSLURM_JOB_ID
environment variabledevices
) based on theSLURM_GPUS_PER_NODE
environment variable, falling back totorch.cuda.device_count()
if not setnum_nodes
) based on theSLURM_JOB_NUM_NODES
environment variable, defaulting to 1 if not setRationale for using SLURM
SLURM (Simple Linux Utility for Resource Management) is a widely used job scheduler and resource manager for high-performance computing (HPC) clusters. It provides a convenient way to allocate and manage resources, including GPUs, across multiple nodes in a cluster.
By leveraging SLURM, we can easily scale our training to utilize multiple GPUs across multiple nodes without the need for manual configuration.