run_squad.py crashes during do_eval

Phirefly9 commented 4 years ago

🐛 Bug

When running run_squad.py as provided in the README once training is complete the prediction/evaluation component of the script crashes.

No prediction files are written are written.

Model I am using (Bert, XLNet....): Bert

Language I am using the model on (English, Chinese....): English

The problem arise when using:

[ ] the official example scripts: (give details) example squad fine tuning

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name) official squad dev set 1.1
[ ] my own task or dataset: (give details) the checkpoint was made using a custom training dataset in squad format, but it appears to be an eval bug

To Reproduce

Steps to reproduce the behavior:

finish training as specified in README
I ran with this command CUDA_VISIBLE_DEVICES=10,11,12,13,14,15 python -m torch.distributed.launch --nproc_per_node=6 ./examples/run_squad.py --model_type bert --model_name_or_path bert-large-uncased-whole-word-masking --do_train --do_eval --do_lower_case --train_file /data/data/SQUAD/train-v1.1json --predict_file /data/data/SQUAD/dev-v1.1.json --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir models/wwm_uncased_finetuned_squad_supp/ --per_gpu_eval_batch_size=6 --per_gpu_train_batch_size=6 --save_steps 500

I also ran with just do_eval using the same model and it produced the same error

11/26/2019 18:29:01 - INFO - main - Saving features into cached file /data/data/SQUAD/cached_dev_bert-large-uncased-whole-word-masking_384 11/26/2019 18:29:20 - INFO - main - Running evaluation 11/26/2019 18:29:20 - INFO - main - Num examples = 10833 11/26/2019 18:29:20 - INFO - main - Batch size = 6 Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 301/301 [00:48<00:00, 6.20it/s] 11/26/2019 18:30:09 - INFO - main - Evaluation done in total 48.868770 secs (0.004511 sec per example) 11/26/2019 18:30:09 - INFO - utils_squad - Writing predictions to: models/wwm_uncased_finetuned_squadsupp/predictions.json 11/26/2019 18:30:09 - INFO - utils_squad - Writing nbest to: models/wwm_uncased_finetuned_squad_supp/nbestpredictions.json Traceback (most recent call last): File "./examples/run_squad.py", line 573, in main() File "./examples/run_squad.py", line 562, in main result = evaluate(args, model, tokenizer, prefix=global_step) File "./examples/run_squad.py", line 284, in evaluate args.version_2_with_negative, args.null_score_diff_threshold) File "/home/clong/git/transformers/examples/utils_squad.py", line 532, in write_predictions result = unique_id_to_result[feature.unique_id] KeyError: 1000000000 Traceback (most recent call last): File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 253, in main() File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', './examples/run_squad.py', '--local_rank=5', '--model_type', 'bert', '--model_name_or_path', 'bert-large-uncased-whole-word-masking', '--do_eval', '--do_lower_case', '--train_file', '/data/data/SQUAD/train-v1.1.json', '--predict_file', '/data/data/SQUAD/dev-v1.1.json', '--learning_rate', '3e-5', '--num_train_epochs', '2', '--max_seq_length', '384', '--doc_stride', '128', '--output_dir', 'models/wwm_uncased_finetuned_squad_supp/', '--per_gpu_eval_batch_size=6', '--per_gpu_train_batch_size=6', '--save_steps', '500']' returned non-zero exit status 1.

Expected behavior

no crashing and predictions written

Environment

OS: Ubuntu 18.04 in NVIDIA pytorch container
Python version: 3.6.9 anaconda
PyTorch version: 1.3.0 custom nvidia version
PyTorch Transformers version (or branch): pip install
Using GPU ? yes
Distributed of parallel setup ? using 6 of 16 gpu's on system
Any other relevant information:

ae86zhizhi commented 4 years ago

same issue

Phirefly9 commented 4 years ago

I tracked it down further this morning and found the problem, you cannot run do_eval in pytorch distributed mode, do_eval works completely fine when there is no pytorch distributed in the equation. This should probably result in a change to the README

mandubian commented 4 years ago

👍Just ran it, I confirm that do_eval runs well without distributed mode

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

huggingface / transformers