huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.7k stars 27.17k forks source link

KeyError: 'eval_loss' when fine-tuning gpt-2 with run_clm.py #8789

Closed Potomac closed 3 years ago

Potomac commented 4 years ago

Environment info

Who can help

albert, bert, GPT2, XLM: @LysandreJik Trainer: @sgugger

Information

Model I am using (Bert, XLNet ...): GPT2

The problem arises when using:

To reproduce

Steps to reproduce the behavior:

  1. Use run_clm.py file from transformers/examples/language-modeling/
  2. Try to fine-tune gpt-2 model, with your own train file and your own validation file
  3. When you add "--do_eval" option in run_clm.py then an error will occur when the step "evaluation" is reached :
  File "run_clm.py", line 353, in <module>
    main()
  File "run_clm.py", line 333, in main
    perplexity = math.exp(eval_output["eval_loss"])
KeyError: 'eval_loss'

when I try to print the content of eval_output then there is just one key : "epoch"

the way I execute run_clm.py :

python run_clm.py \
    --model_name_or_path gpt2 \
    --train_file train.txt \
    --validation_file dev.txt \
    --do_train \
    --do_eval \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --output_dir results/test-clm

Expected behavior

The evaluation step should run without problems.

sgugger commented 4 years ago

This is weird, as the script is tested for evaluation. What does your dev.txt file look like?

Potomac commented 4 years ago

Dev.txt contains text in english, one sentence by line. The PC I use has 2 graphic cards, so run_clm.py uses the 2 cards for the training, perhaps the bug occurs only when 2 or more graphic card are used for the training ?

sgugger commented 4 years ago

The script is tested on 2 GPUs as well as one. Are you sure this file contains enough text to have a least one batch during evaluation? This is the only thing I can think of for not having an eval_loss returned.

Potomac commented 4 years ago

The dev.txt file contains 46 lines, the train file contains 268263 lines.

the specifications of the PC I use :

sgugger commented 4 years ago

Like I said, the dev file is maybe too short to provide at least one batch and return a loss. You should try with a longer dev file.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.