Hi,jhkohpav, thanks for your great teamwork! I have installed the docker environment as you mentioned in the README.md file, and actualy, I can EVAL the whole NuScenes Dataset with the pretrained pth files your provide and got the same result like what you publish online. Now, here is the point , I have the question about how to train the whole mgtanet model in DDP mode. It's weird that my training model even didn't process in the first epoch . It stuck like this: logger out 4 same information each time(as I used 4 gpus)
“[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())”
while it seems like to stuck here forever. Did you ever meet some bugs like these while you trained your model with multiple gpus on one machine? Do you have any idea about how to solve this DDP training stuck porblem? Looking forward for your kind reply when you have free time.Thanks again! 😄
My environment is 8 GeForce RTX 3090 gpus(24GB), and I set the batch_size=2,num_workers_per_gpu=8
Operating System: Ubuntu 22.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 104
Total Memory: 188.5GiB
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Hi,jhkohpav, thanks for your great teamwork! I have installed the docker environment as you mentioned in the README.md file, and actualy, I can EVAL the whole NuScenes Dataset with the pretrained pth files your provide and got the same result like what you publish online. Now, here is the point , I have the question about how to train the whole mgtanet model in DDP mode. It's weird that my training model even didn't process in the first epoch . It stuck like this: logger out 4 same information each time(as I used 4 gpus)
“[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())”
while it seems like to stuck here forever. Did you ever meet some bugs like these while you trained your model with multiple gpus on one machine? Do you have any idea about how to solve this DDP training stuck porblem? Looking forward for your kind reply when you have free time.Thanks again! 😄
My environment is 8 GeForce RTX 3090 gpus(24GB), and I set the batch_size=2,num_workers_per_gpu=8 Operating System: Ubuntu 22.04.2 LTS OSType: linux Architecture: x86_64 CPUs: 104 Total Memory: 188.5GiB Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false
ps: as for one gpu training, there is no problem.