Open justinmanley opened 4 years ago
Our model was trained on multiple GPU without issues, it might be a display issue related to progress bar, or different version of apex...
So can you try with a very small num_optim_steps (say 100) and see if it actually train the model or not?
When I run distributed training with more than one GPU, training gets stuck at the very beginning and hangs indefinitely. It is stuck in FP16_Optimizer#set (specifically at this line, where data is implicitly moved from the GPU to the CPU).
The command line hangs at this indefinitely and makes no progress no matter how I wait:
training: 0%| | 0/1000000 [00:00<?, ?it/s]
I see this issue regardless of which CUDA version I use (I've tried it with 10.0 and 10.1), and regardless of whether I install apex myself or use the docker image
icaruszyz/large-scale-training:dialogpt
.I do not experience this issue when I run
demo.py
rather than usingpython -m torch.distributed.launch
to run training (i.e. I see this issue only when I try to train on multiple GPUs, not on a single GPU). I have not tried training with full 32-bit precision because I want to limit the number of GPUs I have to use.The fact that this issue only occurs when training with multiple GPUs, and that it occurs on a line which transfers data from the GPU to the CPU, suggests to me that there may be a race condition related to collecting data from multiple GPUs.
Training configuration:
This bug is preventing me from fine-tuning the large model, which requires multiple GPUs.
Has anyone else experienced this or found a workaround?