jhkohpav / MGTANet

MIT License
53 stars 5 forks source link

About DDP Training on one machine with multi-gpus 单机多卡分布式训练 #13

Closed ZecCheng closed 7 months ago

ZecCheng commented 7 months ago

Hi,jhkohpav, thanks for your great teamwork! I have installed the docker environment as you mentioned in the README.md file, and actualy, I can EVAL the whole NuScenes Dataset with the pretrained pth files your provide and got the same result like what you publish online. Now, here is the point , I have the question about how to train the whole mgtanet model in DDP mode. It's weird that my training model even didn't process in the first epoch . It stuck like this: logger out 4 same information each time(as I used 4 gpus)

“[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())”

while it seems like to stuck here forever. Did you ever meet some bugs like these while you trained your model with multiple gpus on one machine? Do you have any idea about how to solve this DDP training stuck porblem? Looking forward for your kind reply when you have free time.Thanks again! 😄

My environment is 8 GeForce RTX 3090 gpus(24GB), and I set the batch_size=2,num_workers_per_gpu=8 Operating System: Ubuntu 22.04.2 LTS OSType: linux Architecture: x86_64 CPUs: 104 Total Memory: 188.5GiB Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false

ps: as for one gpu training, there is no problem. 4cb877fd1fe441478804f5ec5cd9e2f 0da3b7035bc9436f73cbfe7a77943f9