huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.87k stars 26.98k forks source link

run_squad.py crashes during do_eval #1955

Closed Phirefly9 closed 4 years ago

Phirefly9 commented 4 years ago

πŸ› Bug

When running run_squad.py as provided in the README once training is complete the prediction/evaluation component of the script crashes.

No prediction files are written are written.

Model I am using (Bert, XLNet....): Bert

Language I am using the model on (English, Chinese....): English

The problem arise when using:

The tasks I am working on is:

To Reproduce

Steps to reproduce the behavior:

  1. finish training as specified in README
  2. I ran with this command CUDA_VISIBLE_DEVICES=10,11,12,13,14,15 python -m torch.distributed.launch --nproc_per_node=6 ./examples/run_squad.py --model_type bert --model_name_or_path bert-large-uncased-whole-word-masking --do_train --do_eval --do_lower_case --train_file /data/data/SQUAD/train-v1.1json --predict_file /data/data/SQUAD/dev-v1.1.json --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir models/wwm_uncased_finetuned_squad_supp/ --per_gpu_eval_batch_size=6 --per_gpu_train_batch_size=6 --save_steps 500

I also ran with just do_eval using the same model and it produced the same error

11/26/2019 18:29:01 - INFO - main - Saving features into cached file /data/data/SQUAD/cached_dev_bert-large-uncased-whole-word-masking_384 11/26/2019 18:29:20 - INFO - main - Running evaluation 11/26/2019 18:29:20 - INFO - main - Num examples = 10833 11/26/2019 18:29:20 - INFO - main - Batch size = 6 Evaluating: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 301/301 [00:48<00:00, 6.20it/s] 11/26/2019 18:30:09 - INFO - main - Evaluation done in total 48.868770 secs (0.004511 sec per example) 11/26/2019 18:30:09 - INFO - utils_squad - Writing predictions to: models/wwm_uncased_finetuned_squadsupp/predictions.json 11/26/2019 18:30:09 - INFO - utils_squad - Writing nbest to: models/wwm_uncased_finetuned_squad_supp/nbestpredictions.json Traceback (most recent call last): File "./examples/run_squad.py", line 573, in main() File "./examples/run_squad.py", line 562, in main result = evaluate(args, model, tokenizer, prefix=global_step) File "./examples/run_squad.py", line 284, in evaluate args.version_2_with_negative, args.null_score_diff_threshold) File "/home/clong/git/transformers/examples/utils_squad.py", line 532, in write_predictions result = unique_id_to_result[feature.unique_id] KeyError: 1000000000 Traceback (most recent call last): File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 253, in main() File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', './examples/run_squad.py', '--local_rank=5', '--model_type', 'bert', '--model_name_or_path', 'bert-large-uncased-whole-word-masking', '--do_eval', '--do_lower_case', '--train_file', '/data/data/SQUAD/train-v1.1.json', '--predict_file', '/data/data/SQUAD/dev-v1.1.json', '--learning_rate', '3e-5', '--num_train_epochs', '2', '--max_seq_length', '384', '--doc_stride', '128', '--output_dir', 'models/wwm_uncased_finetuned_squad_supp/', '--per_gpu_eval_batch_size=6', '--per_gpu_train_batch_size=6', '--save_steps', '500']' returned non-zero exit status 1.

Expected behavior

no crashing and predictions written

Environment

ae86zhizhi commented 4 years ago

same issue

Phirefly9 commented 4 years ago

I tracked it down further this morning and found the problem, you cannot run do_eval in pytorch distributed mode, do_eval works completely fine when there is no pytorch distributed in the equation. This should probably result in a change to the README

mandubian commented 4 years ago

πŸ‘Just ran it, I confirm that do_eval runs well without distributed mode

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.