sacs-epfl / decentralizepy

A decentralized learning research framework
MIT License
24 stars 18 forks source link

Running the framework on a Slurm instance #16

Open MohamedLEGH opened 1 month ago

MohamedLEGH commented 1 month ago

Slurm (https://slurm.schedmd.com/) is an open source cluster management and job scheduling system. I want to run a simulation on a slurm cluster but I don't know how I can parallelize the tasks. Indeed the simulator necessitates that I give the IP address of the nodes but I don't know in advance on which nodes my code will run (maybe I can tweak this but it's a bit complex).

Did anyone already tried to run the framework on a slurm instance, with multiples nodes ?

Thanks, Mohamed Amine