Closed taavi-primer closed 5 years ago
More precisely it hangs on line 280:
if args.local_rank == 0:
HERE ---> torch.distributed.barrier() # Make sure only the first process in distributed training process the dataset, and the others will use the cache
# Convert to Tensors and build dataset
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
What exact command are you using to run the script?
I also encountered similar problems when I run the example of squad. And my pytorch and Python environment are consistent with you. My running script is:
python -m torch.distributed.launch --nproc_per_node=4 ./examples/run_squad.py \
--model_type bert \
--model_name_or_path bert-large-uncased-whole-word-masking \
--do_eval \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ../models/wwm_uncased_finetuned_squad/ \
--per_gpu_eval_batch_size=1 \
--per_gpu_train_batch_size=1 \
--save_steps 10000
Please Help!
What is more, training is OK!But the evaluation has the above problem
I also encountered similar problems when I run the example of squad. And my pytorch and Python environment are consistent with you. My running script is:
python -m torch.distributed.launch --nproc_per_node=4 ./examples/run_squad.py \ --model_type bert \ --model_name_or_path bert-large-uncased-whole-word-masking \ --do_eval \ --do_lower_case \ --train_file $SQUAD_DIR/train-v1.1.json \ --predict_file $SQUAD_DIR/dev-v1.1.json \ --learning_rate 3e-5 \ --num_train_epochs 2 \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir ../models/wwm_uncased_finetuned_squad/ \ --per_gpu_eval_batch_size=1 \ --per_gpu_train_batch_size=1 \ --save_steps 10000
Please Help!
What exact command are you using to run the script?
I think I have encountered a similar problem, I have already reported my running script.
This is what I was running.
python -m torch.distributed.launch --nproc_per_node 4 ./examples/run_glue.py \ --model_type bert \ --model_name_or_path bert-base-uncased \ --task_name MRPC \ --do_train \ --do_eval \ --do_lower_case \ --data_dir $GLUE_DIR/MRPC/ \ --max_seq_length 128 \ --per_gpu_eval_batch_size=8 \ --per_gpu_train_batch_size=8 \ --learning_rate 2e-5 \ --num_train_epochs 3.0 \ --output_dir /tmp/mrpc_output/ \ --overwrite_output_dir \ --overwrite_cache \
The issue seems to be that the processes other than main never enter the evaluation section and the main process waits on a barrier for them to come join the party.
I managed to fix the issue with this change, I can push a PR if you're like. Squad seems to have the same problem.
# Evaluation
results = {}
- if args.do_eval and args.local_rank in [-1, 0]:
+ if args.do_eval:
+ if args.local_rank != -1:
+ torch.distributed.barrier()
We should not allow running the example script in distributed mode when only evaluation is done since the evaluation can only be done on a single GPU anyway (the reason is that the metrics cannot be computed in a distributed setting as some of the GLUE metrics are not additive with regards to the size of the evaluation dataset).
In your case, the answer is just to not run the script in distributed mode when you only do evaluation.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
🐛 Bug
Model I am using (Bert, XLNet....): BERT base uncased
Language I am using the model on (English, Chinese....): English
The problem arise when using:
To Reproduce
Steps to reproduce the behavior:
08/09/2019 18:02:56 - INFO - main - Loading features from cached file /home/taavi/hackathon/glue_data/MRPC/cached_dev_bert-base-uncased_128_mrpc
Expected behavior
Expected to get eval results and for the script to exit with 0.
Environment