Open gurunathparasaram opened 5 years ago
I don't think oversampling would downgrade the scores (but I haven't run these models on the BEA datasets yet). Such a low precision and high recall may suggest that there is something wrong with pre/postprocessing. How did you pre/postprocess the data? CoNLL uses NLTK and BEA uses Spacy for tokenization. Maybe they differ too much. Did you take a look at corrections made by the system?
Another thing might be that the weight for LM is too high. It was grid-searched on CoNLL 2013.
We don't use re-ranking in this system. We only ensemble with a language model.
I performed spell-correction using Jamspell on the BEA source sentences before giving them to the models. Will take a look into the system outputs soon and also try decreasing the weightage for LM. Sorry for the confusion, should have been ensemble+LM
instead of reranking
in my previous comment.
In the Low-resource paper, you have mentioned that NUCLE was over-sampled 10 times for domain-adaptation for CoNLL-14 dataset.
I tried benchmarking the pre-trained models provided in this repo on the WI+LOCNESS test-set.
Single model gave an F-score of 34.15 whereas the ensemble of 4 models+reranking gave an F-score of 53.27. The ensemble gives fewer false positives than the single model leading to higher precision.
Metrics of single model on WI+LOCNESS test-set
Metrics of ensemble on WI+LOCNESS test-set
Does oversampling on NUCLE data lead to a decrease in precision for the single model from 69-70 on CONLL-14 test set to 31.3 on WI+LOCNESS test-set?
Thanks!