The correct way to enable multi-GPU training

sleeplessai commented 3 years ago

Hi, @kwea123

I am conducting some experiments using this MVSNet implementation, since its clear and simple PyTorch Lightning warping. For faster training process, the model is trained with 3 GPUs on my server, while an error comes out when the hyperparameter --gpu_num simply was set to 3. The PyTorch Lightning raised verbose information:

"You seem to have configured a sampler in your DataLoader. This will be replaced "
" by `DistributedSampler` since `replace_sampler_ddp` is True and you are using"
" distributed training. Either remove the sampler from your DataLoader or set"
" `replace_sampler_ddp=False` if you want to use your custom sampler."

To solve this problem, the train.py code has been modified by setting parameter in PL Trainer:

trainer = Trainer(#......
                  gpus=hparams.num_gpus,
                  replace_sampler_ddp=False,
                  distributed_backend='ddp' if hparams.num_gpus>1 else None,
                  # ......)

The model can be trained after this hyperparameter configured.

Is this the correct way to enable multi-GPU training manner? For some reason, I cannot install nvidia-apex for current server. Should and how do I use SyncBatchNorm for this model implementation? Does it bear on performance without SyncBN? Please tell me if I should, using nn.SyncBatchNorm.convert_sync_batchnorm() or PyTorch Lightning sync_bn in Trainer configuration?

Thanks a lot. 😊

Geo-Tell commented 2 years ago

hello,Have you solved the problem @sleeplessai

sleeplessai commented 2 years ago

Hi, @geovsion. Yes, I had solved the multiple GPU training by specifying the num_gpus property for PL trainer and adding SyncBatchNorm support. For this, I updated the main packages PL to 0.9.0 and PyTorch to 1.6.0. As the author didn't give quick reply, I folked the original repo manually to sleeplessai/mvsnet2_pl for maintaining in the future. The code has been tested on a 3 GPU cluster node and works well

kwea123 / MVSNet_pl

The correct way to enable multi-GPU training #8