Closed sm354 closed 2 years ago
You should be getting ~75 F1. Are you sure you are using the right tokenization?
The Transfer-en model is trained using XLM-R tokenization while all the English datasets and models (e.g. QBCoref, OntoNotes, ARRAU) use English BERT tokenization by default. To evaluate on English documents with the transfer-en model, those documents need to be first tokenized following XLM-R tokenization. Using the minimize.py
script, this should be doable by switching the tokenization model from bert to xlmr.
Thanks for clarifying about the tokenization. I was using the default English BERT tokenization. I am able to get 75.35 F1 after using XLM-R tokenization after making the changes in minimize.py
as you described.
We found that evaluating the given transfer-en model on OntoNotes (EN) gives the following scores: muc: 0.6939 0.2912 0.4103 b_cubed: 0.5365 0.1779 0.2672 ceafe: 0.4446 0.1548 0.2296 em: 0.0393 0.0137 0.0203 mentions: 0.7901 0.3187 0.4542
This seems to be quite less compared to spb_on_512 (79.4 avg. F1) which we have been able to reproduce. Could you please share the performance you have got or any insights on these scores?