Closed Ramlinbird closed 4 years ago
I just realized that there is only onll-2012-development.v4.tar.gz file which is used to generate conll file, while the origin ontonotes has been updated to v5.0, which is really confusing. And my experiments was operated on connll files shared by others, there may be some difference. I just got the origin ontonotes data, and I will try it again. Sorry to disturb.
@Ramlinbird have you reached any results using OntoNotes v5.0 that you can share?
I get these results on the ontonoes v5:
version: 8.01 /project/error_analysis/e2e/conll-2012/scorer/v8.01/lib/CorScorer.pm
====== TOTALS =======
Identification of Mentions: Recall: (16651 / 19764) 84.24% Precision: (16651 / 19351) 86.04% F1: 85.13%
--------------------------------------------------------------------------
Coreference: Recall: (12114 / 15232) 79.52% Precision: (12114 / 14882) 81.4% F1: 80.45%
--------------------------------------------------------------------------
Official result for bcub
version: 8.01 /project/error_analysis/e2e/conll-2012/scorer/v8.01/lib/CorScorer.pm
====== TOTALS =======
Identification of Mentions: Recall: (16651 / 19764) 84.24% Precision: (16651 / 19351) 86.04% F1: 85.13%
--------------------------------------------------------------------------
Coreference: Recall: (13714.9966220446 / 19764) 69.39% Precision: (13964.2247782205 / 19351) 72.16% F1: 70.75%
--------------------------------------------------------------------------
Official result for ceafe
version: 8.01 /project/error_analysis/e2e/conll-2012/scorer/v8.01/lib/CorScorer.pm
====== TOTALS =======
Identification of Mentions: Recall: (16651 / 19764) 84.24% Precision: (16651 / 19351) 86.04% F1: 85.13%
--------------------------------------------------------------------------
Coreference: Recall: (3045.98370534764 / 4532) 67.21% Precision: (3045.98370534764 / 4469) 68.15% F1: 67.68%
--------------------------------------------------------------------------
Average F1 (conll): 72.96%
Average F1 (py): 72.96%
Average precision (py): 73.91%
Average recall (py): 72.04%
Have you ever run experiments on ontonotes v5.0 dataset? I tried it without changing any training configuration except switching the data from v4.0 to v5.0 and setting lm_path to None. But the best average F1 score on development dataset is only 61 after 200k steps and it reached a plateau. Hope for your reply, thanks a lot.