Open thomas-ames opened 3 years ago
Hi @thomas-ames, thanks for reporting this problem. I think this issue you ran into is the same with this one #10 which seems to have been solved by @oym050922021 .
Hi, @oym050922021, could you please share your solution to this problem to help @thomas-ames fix it ? Thank you.
hi, sorry, I haven't solved the problem yet.
At 2021-03-27 09:06:55, "hyz-xmaster" @.***> wrote:
Hi @thomas-ames, thanks for reporting this problem. I think this issue you ran into is the same with this one #10 which seems to have been solved by @oym050922021 .
Hi, @oym050922021, could you please share your solution to this problem to help @thomas-ames fix it ? Thank you.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
You may need to upgrade the Nvidia driver according to this answer.
Describe the bug While running distributed training, the script will work fine for 3-5 epochs, then stop running. The GPUs are still active and there is no error or stacktrace provided, but there will be no more output. I cannot tell why it's happening as I've run again and again with the same configuration and environment and the script will stop at irregular intervals. It always seems to be early on, as the latest it has hung is 5 epochs.
Reproduction
./tools/dist_train.sh /home/ec2-user/vfnetx_config.py 8
(The config file is the same as the one in the repo, I just renamed it.)Environment