Multi-gpu train problem

mit-han-lab / spvnas

[ECCV 2020] Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution

http://spvnas.mit.edu/

MIT License

588 stars 109 forks source link

Multi-gpu train problem #75

Closed Gofinge closed 3 years ago

Gofinge commented 3 years ago

After installing Openmpi v4.1.1, I try to train var multi-GPU mode, but the process stuck.

It seems like to be stuck at dist.init()

Gofinge commented 3 years ago

I am also curious about introducing mpi4py to full file distribution training instead of applying the traditional distribution method provided by PyTorch.

zhijian-liu commented 3 years ago

Could you try to do mpirun --version to double-check the version? Besides, you may try re-installing mpi4py to see how it works.

Gofinge commented 3 years ago

Thanks for your reply. My openmpi version is 4.1.1, and I did reinstall mpi4py after building openmpi. So the coding did not throw out an error, It just stuck there.

Meanwhile, I really like your code, and I am currently building my codebase base on your impressive work. But I am curious about why you choose openmpi + torch.distributed to fulfill distributed training instead of the normal method.

zhijian-liu commented 3 years ago

It's because openmpi is easier to be scaled up to multi-node training. Could you try to run torchpack dist-run -np 2 hostname to see if that works?

Gofinge commented 3 years ago

I tried to rebuild the whole dependence on another server and success to start the multi-GPU training process. It might be the reason for the multi-version of openmpi existed in the previous server (I used to try to install openmpi var apt install). Thanks for reply~