Closed mcloarec001 closed 3 years ago
Hi! What are you using to launch your ditributed training? What script are you using? Could you show the command you used, and could you paste your environment information as required by the template? Thank you.
transformers
version: 2.11.0I am using the model xlm-roberta-base. The tasks I am working on is a further train on my own dataset.
The problem arises when using the script examples/language_modeling/run_language_modeling.py with the following command :
python transformers/examples/language-modeling/run_language_modeling.py
--model_type xlm-roberta
--model_name_or_path xlm-roberta-base
--train_data_file data/processed/piaf.txt
--output_dir ./output --learning_rate 0.1
--per_gpu_train_batch_size 2
--local_rank 0
--num_train_epochs 1
--do_train
--mlm
Okay, could you try using the torch.distributed.launch
utility used to launch a distributed training?
Your command, given that you have 8 GPUs, would be:
python -m torch.distributed.launch \
--nproc_per_node 8 transformers/examples/language-modeling/run_language_modeling.py
--model_type xlm-roberta
--model_name_or_path xlm-roberta-base
--train_data_file data/processed/piaf.txt
--output_dir ./output --learning_rate 0.1
--per_gpu_train_batch_size 2
--local_rank 0
--num_train_epochs 1
--do_train
--mlm
Feel free to modify this part: --nproc_per_node 8
according to the number of GPUs.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Even if CUDA detects all the GPU of the machine, there is no distributed training :
RuntimeError: CUDA out of memory. Tried to allocate 3.82 GiB (GPU 0; 11.17 GiB total capacity; 7.59 GiB already allocated; 594.31 MiB free; 10.28 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
The training is launched only on the first GPU. I’m using language modeling code. I tried to set local_rank at 0 and to use the Torch environment variables MASTER_ADDR, MASTER_PORT, WORLD_SIZE and RANK, but without success. Is there another way to do distributed training with transformers ?