About valid and test set result

I have fine-tuned a BERT-NER model and on the eval_result.txt I got these values: P=0.608764 R=0.588080 F=0.594982

In my understanding these results come from the dev dataset (valid). While on the test set I got

processed 40982 tokens with 4577 phrases; found: 4645 phrases; correct: 4158.
accuracy:  98.22%; precision:  89.52%; recall:  90.85%; FB1:  90.18
              LOC: precision:  92.54%; recall:  92.54%; FB1:  92.54  1394
             MISC: precision:  81.21%; recall:  82.31%; FB1:  81.76  676
              ORG: precision:  84.54%; recall:  88.56%; FB1:  86.51  1255
              PER: precision:  95.30%; recall:  95.45%; FB1:  95.38  1320

I'd like to understand the mismatch respecting the conll standard evaluation script.

kyzhouhzau / BERT-NER

About valid and test set result #75