multi gpu mode is much slower than single gpu mode

kwea123 / CasMVSNet_pl

Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching using pytorch-lightning

GNU General Public License v3.0

278 stars 30 forks source link

multi gpu mode is much slower than single gpu mode #31

Closed tau-yihouxiang closed 3 years ago

tau-yihouxiang commented 3 years ago

Thank you for your reimplementation first! However, I found that the training speed of multi gpu mode is much slower than single gpu: single GPU: 1.01s / iter 2 GPUs: 3.84s / iter

kwea123 commented 3 years ago

Hi, I suspect that it's due to low disk read (concurrent read) since in dataset getitem, the GPUs loads the same images: https://github.com/kwea123/CasMVSNet_pl/blob/c94e7b00a6fd73df37117ddee1945fe99a43138d/datasets/dtu.py#L147-L193 Can you check the whole profile by pytorch-lightning? You might need to update the pytorch-lightning version and fix some of the code in train.py.

uestchang commented 2 years ago

感谢作者大佬的开源代码。很抱歉打扰了。我找到这个旧的issue，因为遇到了相同的问题，将num_gpus设置为2，训练时间大约需要3倍。请问有解决方案了吗？ @tau-yihouxiang @kwea123

kwea123 commented 2 years ago

你好，我在上面的回答是因為讀取圖片太慢了，如果要加快，你可能要重寫dataset，要嘛使用快速loader例如nvidia dali，或先把每張圖片png轉成tensor存到某個檔案裡，然後train的時候直接讀取這個檔案