Closed a-maci closed 4 years ago
@ahotrod you have any fix for this bug?
@ahotrod you have any fix for this bug?
@a-maci no unfortunately not, still searching. I'm considering rolling back to Transformers 2.0.0 or even pytorch-transformers 1.2.0, one or both of which didn't spawn this error in my earlier SQuAD replications.
@ahotrod you have any fix for this bug?
@a-maci I needed XLNet fine-tuned on SQuAD 2.0 with 512 max_seq_length. I found "A" solution: went back to the original XLNet paper's github for the "native" code. I could fit 1 batch on each of (2) 1080Ti GPUs, 85,000 steps, ~14.5 hr of fine-tuning with results EM / F1: 84.5 / 87.1.
INFO:tensorflow:Result | best_exact 84.52792049187232 | best_exact_thresh -2.716632127761841 | best_f1 87.12844471348052 | best_f1_thresh -2.447098970413208 | has_ans_exact 0.8733130904183536 | has_ans_f1 0.9327569452896122 |
Possibly try the BERT paper's "native" code?
I've described the bug here: https://github.com/huggingface/transformers/issues/940#issuecomment-547686206
Workaround is either to use DataParallel (remove -m torch.distributed.launch --nproc_per_node=8
) or don't eval in the same run (remove --do_eval
). You can evaluate the model after training with:
python examples/run_squad.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--do_eval \
--do_lower_case \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--train_file $SQUAD_DIR/train-v1.1.json \
--output_dir ./models/wwm_uncased_finetuned_squad/
As mentioned in #940, happy to welcome a PR to fix this case if someone from the community wants to contribute (I don't have the bandwidth for this issue at the moment).
Maybe try changing args.local_rank == -1
to args.local_rank in [-1, 0]
at this line? https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/examples/run_squad.py#L216
I think evaluate is only used in the main process (local_rank==0) if you're using multiple gpus (reference: https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/examples/run_squad.py#L543)
It makes more sense to just remove the DistributedSampler
case entirely. The problem is that all_results
doesn't get gathered from all GPUs. Unless you also implement a gather you shouldn't use DistributedSampler
at all.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is there a fix for this ? Im seeing the same issue for running only evaluation on CPU too.
Are you trying to do multiprocess evaluation? A single CPU process should work, my WAR above is to run eval seperately as a single process.
🐛 Bug
Model: Bert (bert-large-uncased-whole-word-masking)
The problem arises when using: The official example script for finetuning on squad data:
The tasks I am working on is:
Here is the error log:
Additional context
When running on multiple gpus the above problem shows up.