Distributed training hangs indefinitely at FP16_Optimizer#step

When I run distributed training with more than one GPU, training gets stuck at the very beginning and hangs indefinitely. It is stuck in FP16_Optimizer#set (specifically at this line, where data is implicitly moved from the GPU to the CPU).

The command line hangs at this indefinitely and makes no progress no matter how I wait:

training: 0%| | 0/1000000 [00:00<?, ?it/s]

I see this issue regardless of which CUDA version I use (I've tried it with 10.0 and 10.1), and regardless of whether I install apex myself or use the docker image icaruszyz/large-scale-training:dialogpt.

I do not experience this issue when I run demo.py rather than using python -m torch.distributed.launch to run training (i.e. I see this issue only when I try to train on multiple GPUs, not on a single GPU). I have not tried training with full 32-bit precision because I want to limit the number of GPUs I have to use.

The fact that this issue only occurs when training with multiple GPUs, and that it occurs on a line which transfers data from the GPU to the CPU, suggests to me that there may be a race condition related to collecting data from multiple GPUs.

Training configuration:

INFO - __main__ -   Input Argument Information
INFO - __main__ -   model_name_or_path            ./configs/762M
INFO - __main__ -   seed                          42
INFO - __main__ -   max_seq_length                128
INFO - __main__ -   skip_eval                     False
INFO - __main__ -   init_checkpoint               ./models/large/large_fs.pkl
INFO - __main__ -   train_input_file              ./data/train.128len.db
INFO - __main__ -   eval_input_file               ./data/dummy_data.tsv
INFO - __main__ -   continue_from                 0
INFO - __main__ -   train_batch_size              8
INFO - __main__ -   gradient_accumulation_steps   2
INFO - __main__ -   eval_batch_size               16
INFO - __main__ -   learning_rate                 0.0001
INFO - __main__ -   num_optim_steps               1000000
INFO - __main__ -   valid_step                    10000
INFO - __main__ -   warmup_proportion             0.1
INFO - __main__ -   warmup_steps                  16000
INFO - __main__ -   normalize_data                True
INFO - __main__ -   fp16                          True
INFO - __main__ -   lr_schedule                   noam
INFO - __main__ -   loss_scale                    0
INFO - __main__ -   no_token_id                   True
INFO - __main__ -   output_dir                    models/output_model
INFO - __main__ -   log_dir                       None
INFO - __main__ -   pbar                          True
INFO - __main__ -   local_rank                    0
INFO - __main__ -   config                        None
INFO - __main__ -   device                        cuda:0
INFO - __main__ -   n_gpu                         1

This bug is preventing me from fine-tuning the large model, which requires multiple GPUs.

Has anyone else experienced this or found a workaround?

microsoft / DialoGPT

Distributed training hangs indefinitely at FP16_Optimizer#step #35