Questions about ppl when using gpt2

Hi! I meet some problems when running ConvAI2 evaluation scripts:

I first trained a model from OpenAI GPT. I increased the number of cumulative gradients because I only have one card.

python train.py --model_checkpoint /path/to/pretrained/gpt \
--gradient_accumulation_steps=32 --lm_coef=2.0 --max_history=2 \
--n_epochs=1 --num_candidates=4 --personality_permutations=2 \
--train_batch_size=2 --valid_batch_size=2

this gives ConvAI2 evalution results:

Final Hits@1: 0.761
FINAL F1: 0.1659
FINAL PPL: 20.7

Then I tried to train from GPT2-small with the same config:

python train.py --model_checkpoint /path/to/pretrained/gpt2 \
--gradient_accumulation_steps=32 --lm_coef=2.0 --max_history=2 \
--n_epochs=1 --num_candidates=4 --personality_permutations=2 \
--train_batch_size=2 --valid_batch_size=2

and the evaluation results are:

Final Hits@1: 0.737
FINAL F1: 0.1643
FINAL PPL: 178.9

The command I used to run convai_evalution.py is:

python convai_evaluation.py --eval_type ppl --model_checkpoint /path/to/finetuned/model

The ppl of GPT2 is strangely high.

Is there anything that needs to be modified when testing finetuned-gpt2 with convai_evalution.py?

I'm also curious about the best test results and hyperparameters when you finetuned from GPT2. Thank you!

huggingface / transfer-learning-conv-ai

Questions about ppl when using gpt2 #63