Closed luffycodes closed 2 years ago
What metrics are you looking at, and how much are they different?
I am guessing that the discrepancy comes from the difference between validation_f1
and best_validation_f1
?
During training, the metrics show the scores of the current checkpoint (e.g., validation_f1
) and the best checkpoint so far (e.g., best_validation_f1
) .
After the training, the best checkpoint is saved, so the scores from the evaluate
command should match the best scores during training.
I am checking against best_validation_f1
and validation scores after model is training.
For all f1, recall, and precision numbers are different by 1-2 points.
Are the validation batch sizes during training and evaluation different which could lead to erroneous averaging?
The problem has been reproduced on my side (94.7 f1 in train
and 97.2 f1 from evaluate
for the same validation data with luke-large
).
Hmm...the only discrepancy that I am aware of between the train
and evaluate
commands is that it uses automatic mixed precision training at training time but not at evaluation time.
This might be the reason? Small discrepancies in floating point operations would be accumulated through deep layers of a large model, resulting in a significant score difference.
Still, I feel the difference in the scores is too much🤔 I will look into this further.
Few updates.
allennlp evaluate
but observed the same results.I'll keep investigating the problem.
Thanks a lot for looking into the issue. Any thoughts as to which one is correct though?
I think the score from evaluate
should be right.
We can reproduce the right score from the original paper with examples/ner/evaluate_transformers_checkpoint.py
and it uses the same metric code and works similarly to the allennlp evaluate
command.
Also examples/ner/evaluate_transformers_checkpoint.py
gives 97.2 F1 score for the validation split.
I am afraid that it takes some time for me to find the cause of the score discrepancy in allennlp train
command, but I'm sure you can proceed trusting the results from allennlp evaluate
.
That's quite helpful :) Thanks @Ryou0634 , appreciate the help !
hello, when fine-tuning the luke ner models on conll2003 dataset, the validation scores during training and when evaluating are different. command used to train:
allennlp train examples/ner/configs/transformers_luke_with_entity_aware_attention.jsonnet -s results/ner/luke-large --include-package examples -o '{"trainer.cuda_device": 0, "trainer.use_amp": true}'
command used to evaluate:
allennlp evaluate results/ner/luke-large /data/ner_conll/en/valid.txt --include-package examples --output-file results/ner/luke-large/metrics_valid.json --cuda 0
Link