bug: eval_accumulation_steps can lead to incorrect metrics

sjrl commented 1 year ago

System Info

transformers version: 4.31.0.dev0
Platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.17
Python version: 3.8.16
Huggingface_hub version: 0.14.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): 2.11.1 (True)
Flax version (CPU?/GPU?/TPU?): 0.5.3 (cpu)
Jax version: 0.3.6
JaxLib version: 0.3.5
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

Hey @sgugger, I'm tagging you since this has to do with the trainer.

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Using the run_qa.py script in the examples/pytorch/question-answering/ folder

python run_qa.py \
  --model_name_or_path "sjrhuschlee/flan-t5-base-squad2" \
  --dataset_name squad_v2 \
  --output_dir "tmp/eval_squad_v2/" \
  --version_2_with_negative True \
  --max_seq_length 512 \
  --doc_stride 128 \
  --do_eval \
  --per_device_eval_batch_size 24 \
  --tf32 True \
  --dataloader_num_workers 6 \
  --preprocessing_num_workers 6 \
  --bf16_full_eval \
  --eval_accumulation_steps 2 \
  --overwrite_output_dir False

I found that the calculated metrics when using eval_accumulation_steps is not always correct. When not using eval_accumulation_steps with the above script I find that I get the expected metrics. However, I found that I needed to use eval_accumulation_steps for evaluation of the flan-t5 models with the above parameters on my system otherwise the memory usage on the GPU would fluctuate from 4 - 8GB which could cause an OOM.

I believe I found the cause for the inconsistency in the metrics. Specifically this line https://github.com/huggingface/transformers/blob/a074a5d34d6411fb00e83a2ed30acf23d8c976b5/src/transformers/trainer.py#L3150 does not cover the edge case where the total number of batches in the evaluation is not exactly divisible by eval_accumulation_steps. For example, if eval_accumulation_steps = 2 and the total number of batches is 613, then only the last batch is used when calculating all_preds.

I was able to partially fix this problem by adding a new variable called total_steps and updating the if statement

logger.info(f"***** Running {description} *****")
if has_length(dataloader):
    total_steps = len(dataloader)
    logger.info(f"  Num examples = {self.num_examples(dataloader)}")
else:
    total_steps = None
    logger.info("  Num examples: Unknown")

...

  if args.eval_accumulation_steps is not None and (
          (step + 1) % args.eval_accumulation_steps == 0 or (step + 1) == total_steps
  ):

However, this will still be a problem for dataloaders that don't have a defined length.

Expected behavior

Using eval_accumulation_steps should work in every case even when the number of batches is not divisible by eval_accumulation_steps.

muellerzr commented 1 year ago

Thanks for the report! I'll look into a solution for this today

muellerzr commented 1 year ago

@sjrl could you quickly verify that installing transformers via pip install git+https://github.com/huggingface/transformers@fix-eval-accum-steps solves this for you? Thanks!

sjrl commented 1 year ago

Hey @muellerzr thanks for the quick fix! And my apologies I actually can't seem to reproduce the error on my end, but I did check that your change also works.

VeryLazyBoy commented 1 year ago

@muellerzr Sorry for disturbing you. I noticed this PR's change

- if args.eval_accumulation_steps is not None and (step + 1) % args.eval_accumulation_steps == 0:
+ if args.eval_accumulation_steps is not None and self.accelerator.sync_gradients:

breaks the behaviour of evaluation accumulation as described https://github.com/huggingface/transformers/pull/25819. And in the latest v4.33.1, it has been changed partially back to

- if args.eval_accumulation_steps is not None and self.accelerator.sync_gradients:
+ if args.eval_accumulation_steps is not None and (step + 1) % args.eval_accumulation_steps == 0 and self.accelerator.sync_gradients:

May I ask what is the purpose of introducing self.accelerator.sync_gradients check in evaluation loop? In certain cases, this self.accelerator.sync_gradients will be set False in training which prevent the accumulation in the evaluation.

huggingface / transformers