google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.23k stars 571 forks source link

Wrong evaluate result on Squad2.0 #267

Closed YuxuanJiang1 closed 2 years ago

YuxuanJiang1 commented 2 years ago

After fine-tuning on Squad2.0 on evaluating on whole dev-set, I got a result of f1=79% . However, when I divide the complete dataset S into two parts named A and B, and use the same model to evaluate the two new subset, the f1 score for A is 48% and 0.8% for B. Since S=A+B, there must be something wrong with the result. Could somebody help me to figure it out?