Open rohan-mehta-1024 opened 4 months ago
@subramen
Not a slurm expert but looks like the 2 errors are related to incorrect resource allocation by slurm? I see a similar issue on the forum, can you see if this resolves your issue - https://discuss.pytorch.org/t/error-with-ddp-on-multiple-nodes/195251/4
I am using the code from the multinode.py (from this DDP tutorial series https://www.youtube.com/watch?v=KaAJtI1T2x4) file with the following Slurm Script
I am unsure whether the error is it failing to connect to the port, and this causes the downstream error of the different processes trying to use the same GPU, or if these are two separate errors. I have tried using many different ports, but they all give the same failed to connect error. Again, my code is identical to the one in the multinode.py example. I would appreciate any help trying to get to the bottom of this. Thank you.