Allow for multi-GPUs in DDP

BramVanroy commented 3 years ago

There are still issues with DataParallel as far as I can tell, but at least this allows training on multiple GPUs with DDP. Usage:

CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node=3 awesome_align/run_train.py <args>

In my case this reduced training time from 2h15 to 1h10.

Fixed #10 at least partially. I have not tested for DataParallel.

zdou0830 commented 3 years ago

Thanks a lot! But it seems that the following problem would occur on my end, do you have any idea how to fix this?

Also, DataParallel works after making this change (https://github.com/neulab/awesome-align/commit/5c33895f4f434f9d5a17042a36b935e469cc7684), though the training would actually be slower.

THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Traceback (most recent call last): File "run_train.py", line 846, in main() File "run_train.py", line 720, in main torch.cuda.set_device(args.local_rank) File "/nas/home/ziyidou/anaconda3/lib/python3.8/site-packages/torch/cuda/init.py", line 245, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59

BramVanroy commented 3 years ago

@zdou0830 I cannot reproduce that error. Which version of torch are you using and which command are you using? How many GPUs do you have available?

zdou0830 commented 3 years ago

Hi @BramVanroy, I used torch 1.5.0, 4 GPUs were available and the command was

CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node=2 run_train.py --output_dir=$OUTPUT_DIR \ --model_name_or_path=bert-base-multilingual-cased \ --extraction 'softmax' \ --do_train \ --train_so \ --train_tlm \ --train_mlm \ --train_psi \ --train_data_file=$TRAIN_FILE \ --per_gpu_train_batch_size 1 \ --gradient_accumulation_steps 1 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --save_steps 10000 \ --max_steps 40000 \ --do_eval \ --eval_data_file=$EVAL_FILE

The above problem doesn't always occur, but when it doesn't occur, another problem would arise:

Traceback (most recent call last): File "awesome_align/run_train.py", line 846, in main() File "awesome_align/run_train.py", line 804, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "awesome_align/run_train.py", line 338, in train loss = model(inputs_src=inputs_src, labels_src=labels_src) File "/nas/home/ziyidou/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/nas/home/ziyidou/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 460, in forward self.reducer.prepare_for_backward(list(_find_tensors(output))) RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

BramVanroy commented 3 years ago

Seems that the issue lies in your implementation. Not sure why it leads to problems for you and not for me, but that might be CuDNN/version related. The problem, as per point 2 in that error trace, could be the forward of your encoder.

https://github.com/neulab/awesome-align/blob/6a64986d855a11853b767d32220ef3e34791c562/awesome_align/modeling.py#L335-L349

Let's say align_layer==3, then you first do forward passes through layers 0 and 1, and then you get the outputs of layer 2 and return those. The forwards of 0 and 1 seem to not do anything and do not contribute to calculating the final loss.

Any reason why you don't just do?:

return self.layer[align_layer-1](hidden_states, attention_mask)

Because you do not have to do a bunch of forwards that you do not use, this should also be faster.

zdou0830 commented 3 years ago

I actually don't think these lines are related to the problem.

Doing return self.layer[align_layer-1](hidden_states, attention_mask) will just pass the inputs to the align_layer-th layer (without going through the embedding layer and all the previous layers). To get the output of the align_layer-th layer, you need to pass the inputs to all the previous layers.

BramVanroy commented 3 years ago

You are completely right, I missed the re-assignment of hidden_states. I am not sure why you are getting that error then as I cannot reproduce it. At the moment I do not have the time to look into this further. Out of curiosity, does the problem also occur with devices 0,1?

I do remember that there were some (D)DataParallel issues with earlier torch version. We had that problem over at transformers. Can you try this PR with a recent torch version?

zdou0830 commented 3 years ago

I did some testing and it seems that this is indeed a bug in pytorch (https://github.com/pytorch/pytorch/issues/41324) which has been fixed recently. I'll merge this PR while opening an issue. Thanks!

neulab / awesome-align

Allow for multi-GPUs in DDP #12