Discrepancy in validation score while training vs testing

studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings

Apache License 2.0

705 stars 101 forks source link

Discrepancy in validation score while training vs testing #155

Closed luffycodes closed 2 years ago

luffycodes commented 2 years ago

hello, when fine-tuning the luke ner models on conll2003 dataset, the validation scores during training and when evaluating are different. command used to train: allennlp train examples/ner/configs/transformers_luke_with_entity_aware_attention.jsonnet -s results/ner/luke-large --include-package examples -o '{"trainer.cuda_device": 0, "trainer.use_amp": true}'

command used to evaluate: allennlp evaluate results/ner/luke-large /data/ner_conll/en/valid.txt --include-package examples --output-file results/ner/luke-large/metrics_valid.json --cuda 0 Link

ryokan0123 commented 2 years ago

What metrics are you looking at, and how much are they different?

I am guessing that the discrepancy comes from the difference between validation_f1 and best_validation_f1? During training, the metrics show the scores of the current checkpoint (e.g., validation_f1) and the best checkpoint so far (e.g., best_validation_f1) . After the training, the best checkpoint is saved, so the scores from the evaluate command should match the best scores during training.

luffycodes commented 2 years ago

I am checking against best_validation_f1 and validation scores after model is training.

For all f1, recall, and precision numbers are different by 1-2 points.

Are the validation batch sizes during training and evaluation different which could lead to erroneous averaging?

ryokan0123 commented 2 years ago

The problem has been reproduced on my side (94.7 f1 in train and 97.2 f1 from evaluate for the same validation data with luke-large ).

Hmm...the only discrepancy that I am aware of between the train and evaluate commands is that it uses automatic mixed precision training at training time but not at evaluation time. This might be the reason? Small discrepancies in floating point operations would be accumulated through deep layers of a large model, resulting in a significant score difference.

Still, I feel the difference in the scores is too much🤔 I will look into this further.

ryokan0123 commented 2 years ago

Few updates.

Amp training is not the reason for the discrepancy.
I tried different batch sizes in allennlp evaluate but observed the same results.

I'll keep investigating the problem.

luffycodes commented 2 years ago

Thanks a lot for looking into the issue. Any thoughts as to which one is correct though?

ryokan0123 commented 2 years ago

I think the score from evaluate should be right. We can reproduce the right score from the original paper with examples/ner/evaluate_transformers_checkpoint.py and it uses the same metric code and works similarly to the allennlp evaluate command.

Also examples/ner/evaluate_transformers_checkpoint.py gives 97.2 F1 score for the validation split.

I am afraid that it takes some time for me to find the cause of the score discrepancy in allennlp train command, but I'm sure you can proceed trusting the results from allennlp evaluate.

luffycodes commented 2 years ago

That's quite helpful :) Thanks @Ryou0634 , appreciate the help !