Open sleeplessai opened 3 years ago
hello,Have you solved the problem @sleeplessai
Hi, @geovsion. Yes, I had solved the multiple GPU training by specifying the num_gpus property for PL trainer and adding SyncBatchNorm support. For this, I updated the main packages PL to 0.9.0 and PyTorch to 1.6.0. As the author didn't give quick reply, I folked the original repo manually to sleeplessai/mvsnet2_pl for maintaining in the future. The code has been tested on a 3 GPU cluster node and works well
Hi, @kwea123
I am conducting some experiments using this MVSNet implementation, since its clear and simple PyTorch Lightning warping. For faster training process, the model is trained with 3 GPUs on my server, while an error comes out when the hyperparameter --gpu_num simply was set to 3. The PyTorch Lightning raised verbose information:
To solve this problem, the train.py code has been modified by setting parameter in PL Trainer:
The model can be trained after this hyperparameter configured.
Is this the correct way to enable multi-GPU training manner? For some reason, I cannot install nvidia-apex for current server. Should and how do I use SyncBatchNorm for this model implementation? Does it bear on performance without SyncBN? Please tell me if I should, using nn.SyncBatchNorm.convert_sync_batchnorm() or PyTorch Lightning sync_bn in Trainer configuration?
Thanks a lot. 😊