Distributed training doesn’t work

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

133.86k stars 26.77k forks source link

Distributed training doesn’t work #6766

Closed mcloarec001 closed 3 years ago

mcloarec001 commented 4 years ago

Even if CUDA detects all the GPU of the machine, there is no distributed training :

RuntimeError: CUDA out of memory. Tried to allocate 3.82 GiB (GPU 0; 11.17 GiB total capacity; 7.59 GiB already allocated; 594.31 MiB free; 10.28 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)

The training is launched only on the first GPU. I’m using language modeling code. I tried to set local_rank at 0 and to use the Torch environment variables MASTER_ADDR, MASTER_PORT, WORLD_SIZE and RANK, but without success. Is there another way to do distributed training with transformers ?

LysandreJik commented 4 years ago

Hi! What are you using to launch your ditributed training? What script are you using? Could you show the command you used, and could you paste your environment information as required by the template? Thank you.

mcloarec001 commented 4 years ago

Environment info

transformers version: 2.11.0
Platform: Databricks
Python version: 3.6.9
PyTorch version: 1.3.1
Tensorflow version : 2.0
Using GPU in script: yes
Using distributed or parallel set-up in script: try to

Information

I am using the model xlm-roberta-base. The tasks I am working on is a further train on my own dataset.

The problem arises when using the script examples/language_modeling/run_language_modeling.py with the following command :

python transformers/examples/language-modeling/run_language_modeling.py 
    --model_type xlm-roberta 
    --model_name_or_path xlm-roberta-base 
    --train_data_file data/processed/piaf.txt 
    --output_dir ./output --learning_rate 0.1 
    --per_gpu_train_batch_size 2 
    --local_rank 0 
    --num_train_epochs 1 
    --do_train 
    --mlm

LysandreJik commented 4 years ago

Okay, could you try using the torch.distributed.launch utility used to launch a distributed training?

Your command, given that you have 8 GPUs, would be:

python -m torch.distributed.launch \
    --nproc_per_node 8 transformers/examples/language-modeling/run_language_modeling.py 
    --model_type xlm-roberta 
    --model_name_or_path xlm-roberta-base 
    --train_data_file data/processed/piaf.txt 
    --output_dir ./output --learning_rate 0.1 
    --per_gpu_train_batch_size 2 
    --local_rank 0 
    --num_train_epochs 1 
    --do_train 
    --mlm

Feel free to modify this part: --nproc_per_node 8 according to the number of GPUs.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.