Closed Gofinge closed 3 years ago
I am also curious about introducing mpi4py to full file distribution training instead of applying the traditional distribution method provided by PyTorch.
Could you try to do mpirun --version
to double-check the version? Besides, you may try re-installing mpi4py
to see how it works.
Thanks for your reply. My openmpi
version is 4.1.1, and I did reinstall mpi4py
after building openmpi
. So the coding did not throw out an error, It just stuck there.
Meanwhile, I really like your code, and I am currently building my codebase base on your impressive work. But I am curious about why you choose openmpi
+ torch.distributed
to fulfill distributed training instead of the normal method.
It's because openmpi
is easier to be scaled up to multi-node training. Could you try to run torchpack dist-run -np 2 hostname
to see if that works?
I tried to rebuild the whole dependence on another server and success to start the multi-GPU training process. It might be the reason for the multi-version of openmpi
existed in the previous server (I used to try to install openmpi
var apt install
). Thanks for reply~
After installing Openmpi v4.1.1, I try to train var multi-GPU mode, but the process stuck.
It seems like to be stuck at dist.init()