huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.73k stars 26.94k forks source link

Running the pytorch.distributed.launch example of Glue hangs at evaluation #998

Closed taavi-primer closed 5 years ago

taavi-primer commented 5 years ago

🐛 Bug

Model I am using (Bert, XLNet....): BERT base uncased

Language I am using the model on (English, Chinese....): English

The problem arise when using:

To Reproduce

Steps to reproduce the behavior:

  1. Run the glue example from documentation on a multi-gpu machine with 4 GPUs (The only change I made was switch the base model to BERT uncased base) and number of GPUs to 4
  2. Training completes fine
  3. Script tries to evaluate - hangs at:

08/09/2019 18:02:56 - INFO - main - Loading features from cached file /home/taavi/hackathon/glue_data/MRPC/cached_dev_bert-base-uncased_128_mrpc

Expected behavior

Expected to get eval results and for the script to exit with 0.

Environment

taavi-primer commented 5 years ago

More precisely it hangs on line 280:

if args.local_rank == 0:

HERE ---> torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache

# Convert to Tensors and build dataset
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
thomwolf commented 5 years ago

What exact command are you using to run the script?

ZhouWlnd commented 5 years ago

I also encountered similar problems when I run the example of squad. And my pytorch and Python environment are consistent with you. My running script is:

python -m torch.distributed.launch --nproc_per_node=4 ./examples/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_eval \
    --do_lower_case \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ../models/wwm_uncased_finetuned_squad/ \
    --per_gpu_eval_batch_size=1  \
    --per_gpu_train_batch_size=1   \
    --save_steps 10000

Please Help!

What is more, training is OK!But the evaluation has the above problem

ZhouWlnd commented 5 years ago

I also encountered similar problems when I run the example of squad. And my pytorch and Python environment are consistent with you. My running script is:

python -m torch.distributed.launch --nproc_per_node=4 ./examples/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_eval \
    --do_lower_case \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ../models/wwm_uncased_finetuned_squad/ \
    --per_gpu_eval_batch_size=1  \
    --per_gpu_train_batch_size=1   \
    --save_steps 10000

Please Help!

What exact command are you using to run the script?

I think I have encountered a similar problem, I have already reported my running script.

taavi-primer commented 5 years ago

This is what I was running.

python -m torch.distributed.launch --nproc_per_node 4 ./examples/run_glue.py \ --model_type bert \ --model_name_or_path bert-base-uncased \ --task_name MRPC \ --do_train \ --do_eval \ --do_lower_case \ --data_dir $GLUE_DIR/MRPC/ \ --max_seq_length 128 \ --per_gpu_eval_batch_size=8 \ --per_gpu_train_batch_size=8 \ --learning_rate 2e-5 \ --num_train_epochs 3.0 \ --output_dir /tmp/mrpc_output/ \ --overwrite_output_dir \ --overwrite_cache \

The issue seems to be that the processes other than main never enter the evaluation section and the main process waits on a barrier for them to come join the party.

I managed to fix the issue with this change, I can push a PR if you're like. Squad seems to have the same problem.

     # Evaluation
     results = {}
-    if args.do_eval and args.local_rank in [-1, 0]:
+    if args.do_eval:
+        if args.local_rank != -1:
+            torch.distributed.barrier()
thomwolf commented 5 years ago

We should not allow running the example script in distributed mode when only evaluation is done since the evaluation can only be done on a single GPU anyway (the reason is that the metrics cannot be computed in a distributed setting as some of the GLUE metrics are not additive with regards to the size of the evaluation dataset).

In your case, the answer is just to not run the script in distributed mode when you only do evaluation.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.