huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.73k stars 26.94k forks source link

CUDA out of memory for 8x V100 GPU #2084

Closed mittalpatel closed 4 years ago

mittalpatel commented 4 years ago
python -m torch.distributed.launch --nproc_per_node=8 run_squad.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
    --do_train \
    --do_eval \
    --do_lower_case \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ../models/wwm_uncased_finetuned_squad/ \
    --per_gpu_train_batch_size 24 \
    --gradient_accumulation_steps 12

We are trying the same command (except bert-base-cased, we are using bert-large-uncased-whole-word-masking) on 8x V100 GPU but getting CUDA out of memory error (CUDA out of memory. Tried to allocate 216.00 MiB....)

As per the https://github.com/huggingface/transformers/tree/master/examples it should work but it's giving error and stopping in the middle. Any tips would be appreciated.

LysandreJik commented 4 years ago

bert large is bigger than bert base. You're using a batch size of 24 (which is big, especially with 12 gradient accumulation steps).

Reduce your batch size in order for your model + your tensors to fit on the GPU and you won't experience the same error!

mittalpatel commented 4 years ago

Right @LysandreJik , reducing the batch size did fix the error but it looks like the generated model we receive is not same as provided by huggingface.

In our demo of closed domain QnA, https://demos.pragnakalp.com/bert-chatbot-demo, the answers are pretty good where we are using the model provided by huggingface (bert-large-uncased-whole-word-masking-finetuned-squad). But when we finetune on our own and even though we get 93.XX f1 score the accuracy of the model is not same as demo.

What other parameters were set by huggingface to generate "bert-large-uncased-whole-word-masking-finetuned-squad" model?

LysandreJik commented 4 years ago

If the only difference between the command you used and the command available here is the batch size, you could try and adjust the gradient accumulation so that the resulting batch size is unchanged. For example if you put batch size equal to 6 (1/4 of the specified batch size, 24), you can multiply by 4 the gradient accumulation steps (-> 48) so that you keep the same batch size.

What exact_match result did you obtain alongside the 93.xx F1 score?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.