sadamov commented 1 month ago

Enable multi-node GPU training with SLURM

This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow the code to detect if it is running within a SLURM job and automatically configure the number of devices and nodes based on the SLURM environment variables.

Key changes

Set use_distributed_sampler to True when not in evaluation mode to enable distributed training
Detect if running within a SLURM job by checking for the SLURM_JOB_ID environment variable
If running with SLURM:
- Set the number of devices per node (devices) based on the SLURM_GPUS_PER_NODE environment variable, falling back to torch.cuda.device_count() if not set
- Set the total number of nodes (num_nodes) based on the SLURM_JOB_NUM_NODES environment variable, defaulting to 1 if not set

Rationale for using SLURM

SLURM (Simple Linux Utility for Resource Management) is a widely used job scheduler and resource manager for high-performance computing (HPC) clusters. It provides a convenient way to allocate and manage resources, including GPUs, across multiple nodes in a cluster.

By leveraging SLURM, we can easily scale our training to utilize multiple GPUs across multiple nodes without the need for manual configuration.

joeloskarsson commented 3 weeks ago

An update on my testing of this: The SLURM constants are read correctly also on our cluster, but I have yet to be able to get multi-node training working. I think this is however unrelated to this code, but rather related to me not having the correct setup for running multi-node on our cluster. Will ask around to see if I can get it working.

In the meantime, @leifdenby (or anyone at DMI :smile:), do you have a slurm setup that you could test this on? I just think it's a good idea to test on multiple different clusters to make sure that this is general enough.

sadamov commented 3 weeks ago

I have implemented the latest feedback, updated the CHANGELOG and added a SLURM-example submission script to /docs/examples (is that a good location?) as discussed with @leifdenby. A new small section was added to the README.md. @joeloskarsson yes, every cluster is different and I also have to adapt my submission scripts after major changes. Do you have ticket-support with your HPC-provider? They usually know what to do...

mllam / neural-lam

Introduce multi-node training setup #26

Enable multi-node GPU training with SLURM

Key changes

Rationale for using SLURM