DistributedDataParallel does not work on some PyTorch versions

neulab / awesome-align

A neural word aligner based on multilingual BERT

https://arxiv.org/abs/2101.08231

BSD 3-Clause "New" or "Revised" License

321 stars 46 forks source link

DistributedDataParallel does not work on some PyTorch versions #13

Closed zdou0830 closed 2 years ago

zdou0830 commented 3 years ago

https://github.com/neulab/awesome-align/blob/c4e59934cfb4abbfb4915a6f21aa5c9ca67fd55c/awesome_align/modeling.py#L379

and

https://github.com/neulab/awesome-align/blob/c4e59934cfb4abbfb4915a6f21aa5c9ca67fd55c/awesome_align/modeling.py#L408

would induce errors for DistributedDataParallel on some old PyTorch versions (<=1.7.1).

As in https://github.com/pytorch/pytorch/issues/41324, one workaround is to change the above lines to

self.decoder.bias = nn.Parameter(self.bias.clone())

ruoyuxie commented 2 years ago

Hi,

I am still getting _AttributeError: 'DataParallel' object has no attribute 'get_aligned_word'_ error while using multiple GPUs to fine-tune the model. My PyTorch version is 1.10 and I also changed the code in modeling.py. Is there anyway I can fix it?

Thanks!

zdou0830 commented 2 years ago

Hi @ruoyuxyz, thanks! I just updated the code and could you test if the issue is fixed? Also, it is recommended to use DistributedDataParallel instead of DataParallel.

An example command:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 run_train.py
--output_dir=$OUTPUT_DIR \
--model_name_or_path=bert-base-multilingual-cased \
--extraction 'softmax' \
--do_train \
--train_so \
--train_tlm \
--train_data_file=$TRAIN_FILE \
--per_gpu_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--num_train_epochs 1 \
--learning_rate 2e-5 \
--save_steps 4000 \
--max_steps 20000