huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.29k stars 26.35k forks source link

Bug when finetuning model on Squad #1472

Closed a-maci closed 4 years ago

a-maci commented 4 years ago

🐛 Bug

Model: Bert (bert-large-uncased-whole-word-masking)

The problem arises when using: The official example script for finetuning on squad data:

python -m torch.distributed.launch --nproc_per_node=8 run_squad.py \
      --model_type bert \
      --model_name_or_path bert-large-uncased-whole-word-masking \
      --do_train \
      --do_eval \
      --do_lower_case \
      --train_file $SQUAD_DIR/train-v1.1.json \
      --predict_file $SQUAD_DIR/dev-v1.1.json \
      --learning_rate 3e-5 \
      --num_train_epochs 2 \
      --max_seq_length 384 \
      --doc_stride 128 \
      --output_dir ./models/wwm_uncased_finetuned_squad/ \
      --per_gpu_eval_batch_size 3 \
      --per_gpu_train_batch_size 3 \
      --save_steps 1500 \
      --logging_steps 250 \
      --fp16 

The tasks I am working on is:

Here is the error log:

...
10/09/2019 17:03:29 - INFO - utils_squad -   Writing predictions to: ./models/wwm_uncased_finetuned_squad/predictions_.json
10/09/2019 17:03:29 - INFO - utils_squad -   Writing nbest to: ./models/wwm_uncased_finetuned_squad/nbest_predictions_.json
Traceback (most recent call last):
  File "run_squad.py", line 537, in <module>
    main()
  File "run_squad.py", line 526, in main
    result = evaluate(args, model, tokenizer, prefix=global_step)
  File "run_squad.py", line 268, in evaluate
    args.version_2_with_negative, args.null_score_diff_threshold)
  File "/dl/huggingface-bert/transformers/examples/SQuAD_runs/rundir/utils_squad.py", line 511, in write_predictions
    result = unique_id_to_result[feature.unique_id]
KeyError: 1000000000

Additional context

When running on multiple gpus the above problem shows up.

ahotrod commented 4 years ago

https://github.com/huggingface/transformers/issues/940

a-maci commented 4 years ago

@ahotrod you have any fix for this bug?

ahotrod commented 4 years ago

@ahotrod you have any fix for this bug?

@a-maci no unfortunately not, still searching. I'm considering rolling back to Transformers 2.0.0 or even pytorch-transformers 1.2.0, one or both of which didn't spawn this error in my earlier SQuAD replications.

ahotrod commented 4 years ago

@ahotrod you have any fix for this bug?

@a-maci I needed XLNet fine-tuned on SQuAD 2.0 with 512 max_seq_length. I found "A" solution: went back to the original XLNet paper's github for the "native" code. I could fit 1 batch on each of (2) 1080Ti GPUs, 85,000 steps, ~14.5 hr of fine-tuning with results EM / F1: 84.5 / 87.1.

INFO:tensorflow:Result | best_exact 84.52792049187232 | best_exact_thresh -2.716632127761841 | best_f1 87.12844471348052 | best_f1_thresh -2.447098970413208 | has_ans_exact 0.8733130904183536 | has_ans_f1 0.9327569452896122 |

Possibly try the BERT paper's "native" code?

immawatson commented 4 years ago

I've described the bug here: https://github.com/huggingface/transformers/issues/940#issuecomment-547686206

Workaround is either to use DataParallel (remove -m torch.distributed.launch --nproc_per_node=8) or don't eval in the same run (remove --do_eval). You can evaluate the model after training with:

python examples/run_squad.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--do_eval \
--do_lower_case \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--train_file $SQUAD_DIR/train-v1.1.json \
--output_dir ./models/wwm_uncased_finetuned_squad/
thomwolf commented 4 years ago

As mentioned in #940, happy to welcome a PR to fix this case if someone from the community wants to contribute (I don't have the bandwidth for this issue at the moment).

cherry979988 commented 4 years ago

Maybe try changing args.local_rank == -1 to args.local_rank in [-1, 0] at this line? https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/examples/run_squad.py#L216

I think evaluate is only used in the main process (local_rank==0) if you're using multiple gpus (reference: https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/examples/run_squad.py#L543)

immawatson commented 4 years ago

It makes more sense to just remove the DistributedSampler case entirely. The problem is that all_results doesn't get gathered from all GPUs. Unless you also implement a gather you shouldn't use DistributedSampler at all.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

shubhangi-tandon commented 4 years ago

Is there a fix for this ? Im seeing the same issue for running only evaluation on CPU too.

immawatson commented 4 years ago

Are you trying to do multiprocess evaluation? A single CPU process should work, my WAR above is to run eval seperately as a single process.