Problem of distributed training

valeoai / RADIal

[CVPR 2022] RADIAl: Raw High-Definition Radar for Multi-Task Learning

180 stars 57 forks source link

Problem of distributed training #36

Open eagles1812 opened 2 years ago

eagles1812 commented 2 years ago

Thanks for the great paper, dataset and code!

I tried to train the model with ready data using single GPU, it took roughly half day. So I tried to add some distributed training component, the training time decreased, but also the AP/AR/IOU values. Have you tested distributed training? How do you correctly set the parameters to ensure shorter training time and proper AP/AR/IOU values?

Thank you!

jrebut commented 2 years ago

Hi, can you please explain what you means by distributed training component?

Julien

eagles1812 commented 2 years ago

Thanks for your reply. In your code suite, you used one GPU for training, for a large dataset such as yours, it takes very long time. I have multiple GPUs, and wanted to decrease the training time, so I modified your training code to include distributed training component following articles such as https://towardsdatascience.com/how-to-scale-training-on-multiple-gpus-dae1041f49d2, then I met the problem I listed in the previous post. Thanks!