Open kermitt2 opened 6 years ago
Hello Kamal,
Can you please suggest on the accuracy of each class? CoNLL2003 being a class imbalanced dataset, can you suggest if the average of all classes taken was either a macro average or a micro average?
The accuracy for the dominant class (others) is good but, that for the rest of the classes isn't that great so can you suggest if you'd done anything for correcting class imbalance?
Regards
Nice work, thanks !
Using GloVe embeddings as indicated and increasing the number of epochs to 70 without touching anything else, I obtained a f-score of 89.16 averaged over 10 runs (to take into account the impact of random seeds), best f-score is 89.7, worst is 88.5. So my results are far below what is reported in the readme 90.9 (I tried to raise the number of epochs to 80 without change)
If we consider that the results of (Chui & Nichols, 2016) are reproducible, they report 90.88 f-score with a model without lexical features (Table 6, emb + char + caps). One difference with your implementation I think is that they report results with models trained with the validation set, which normally increase f-score by +.0.3 and +0.4, while you are not using the validation for training.
Another point is that they report f-score average over 10 runs.
Trying to reproduce (Chui & Nichols, 2016), so far I've never been able to get above 90.0 f-score with their architecture and hyper-parameters, and so I got similar f-score as your implementation.