Open ivysoftware opened 5 years ago
I met the same situation.
i forked another repository which use bilstm-crf on the top of bert model.
https://github.com/dsindex/BERT-BiLSTM-CRF-NER/blob/master/README.md
this module yields 0.95~0.96 of f1 score on dev set. however, after aligning the predicted output(on test set) with original test data and evaluate it by using conlleval.pl(official evaluation tool). the final f-score is around 91.1 ~ 91.3. this is worse than the score reported on the paper(BERT-base, 92.4).
i guess there was additional tricks for parameter tuning.
Paper mentions that hyperparameters are tuned using the dev set. Note that the authors had quite a lot of incentive and means to tune it well.
I got only 0.87 with batch-size=16.
Thank for your timely work! when running on GPU, conll2003 doesn't perform as good as you or paper result, by the way, I tried several times , the dev F score is wandering [0.89, 0.912]. Does your work run on TPU ?