Closed Phirefly9 closed 4 years ago
same issue
I tracked it down further this morning and found the problem, you cannot run do_eval in pytorch distributed mode, do_eval works completely fine when there is no pytorch distributed in the equation. This should probably result in a change to the README
πJust ran it, I confirm that do_eval runs well without distributed mode
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
π Bug
When running run_squad.py as provided in the README once training is complete the prediction/evaluation component of the script crashes.
No prediction files are written are written.
Model I am using (Bert, XLNet....): Bert
Language I am using the model on (English, Chinese....): English
The problem arise when using:
The tasks I am working on is:
To Reproduce
Steps to reproduce the behavior:
I also ran with just do_eval using the same model and it produced the same error
11/26/2019 18:29:01 - INFO - main - Saving features into cached file /data/data/SQUAD/cached_dev_bert-large-uncased-whole-word-masking_384 11/26/2019 18:29:20 - INFO - main - Running evaluation 11/26/2019 18:29:20 - INFO - main - Num examples = 10833 11/26/2019 18:29:20 - INFO - main - Batch size = 6 Evaluating: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 301/301 [00:48<00:00, 6.20it/s] 11/26/2019 18:30:09 - INFO - main - Evaluation done in total 48.868770 secs (0.004511 sec per example) 11/26/2019 18:30:09 - INFO - utils_squad - Writing predictions to: models/wwm_uncased_finetuned_squadsupp/predictions.json 11/26/2019 18:30:09 - INFO - utils_squad - Writing nbest to: models/wwm_uncased_finetuned_squad_supp/nbestpredictions.json Traceback (most recent call last): File "./examples/run_squad.py", line 573, in
main()
File "./examples/run_squad.py", line 562, in main
result = evaluate(args, model, tokenizer, prefix=global_step)
File "./examples/run_squad.py", line 284, in evaluate
args.version_2_with_negative, args.null_score_diff_threshold)
File "/home/clong/git/transformers/examples/utils_squad.py", line 532, in write_predictions
result = unique_id_to_result[feature.unique_id]
KeyError: 1000000000
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 253, in
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 249, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', './examples/run_squad.py', '--local_rank=5', '--model_type', 'bert', '--model_name_or_path', 'bert-large-uncased-whole-word-masking', '--do_eval', '--do_lower_case', '--train_file', '/data/data/SQUAD/train-v1.1.json', '--predict_file', '/data/data/SQUAD/dev-v1.1.json', '--learning_rate', '3e-5', '--num_train_epochs', '2', '--max_seq_length', '384', '--doc_stride', '128', '--output_dir', 'models/wwm_uncased_finetuned_squad_supp/', '--per_gpu_eval_batch_size=6', '--per_gpu_train_batch_size=6', '--save_steps', '500']' returned non-zero exit status 1.
Expected behavior
no crashing and predictions written
Environment