Closed YeDeming closed 1 year ago
Awesome! I may have gotten mixed up with the various versions. I've used this one myself: https://huggingface.co/datasets/tner/ontonotes5 I'll dig into yours to get a better understanding, as I couldn't quickly explain the difference in precision.
Thanks for the heads up!
Tom Aarsen
It seems that your dataset has 75187 train sentences, 9603 dev sentences and 9479 test sentences, compared to 59.9k train sentences, 8.53k for validation and 8.26k for the test set. I'm afraid this also differs from what I reported in my thesis, which must have been based on another version.
I do see matching sentences between the two versions here, so it's at least partially the same source. Perhaps one is de-duplicated.
You can further delete the sentence whose doc_key=='pt/xx'. These 17,555 sentences have no NER annotations. I remember that I can get similar results without the pt data.
By the way, I upload the conll.py to the google cloud.
When removing the pt/xx
files, the dataset seems to exactly match https://huggingface.co/datasets/tner/ontonotes5 in number of samples. However, I can use document-level context with your variant. I.e. add previous and next sentences as context. I'll run some quick experiments with this.
@YeDeming I've trained a quick model using your data and ontonotesv5.py, but then modified like so:
model_name = "roberta-large"
model = SpanMarkerModel.from_pretrained(
model_name,
labels=labels,
# SpanMarker hyperparameters:
model_max_length=256,
marker_max_length=128,
entity_max_length=10,
+ max_prev_context=2,
+ max_next_context=2,
)
This adds a maximum of two sentence as context on either side, both during training and inference. My experiments with CoNLL03 showed that including as much context as possible is best, but just using 2 is a bit quicker to train.
My first model trained in this manner reached 91.796 F1 (91.629 Precision & 91.963 Recall). As expected - using this document-level context improves the performance over my previous results (91.35 F1, 90.96 Precision, 91.75 Recall). If I manage to publish this as a paper, then I'll update the findings to include experiments with document-level context.
These findings seem to indicate that SpanMarker is also on par with PL-Marker on OntoNotes (which reaches 91.9±0.1 F1), especially if I rerun these experiments with "as much context as possible"-settings. Only on OntoNotes was there a notable difference in performance, so this is very exciting indeed. I think that I can now confidently say that SpanMarker reaches PL-Marker level performance, while providing a library that can be used for practitioners looking to solve their own NER problems. I appreciate your assistance on this!
Glad to hear your results. Let me know if there is anything that I can help!
Deming
I'll close this for now. Thanks for your assistance!
Hi Tom,
Sorry for missing your email address. I noticed that you get slightly worse results on ontonotes dataset.
There are several versions of ontonotes. In PL-Marker, I preprocess CONLL-2012 by https://drive.google.com/file/d/1cFVb5thXNCXZkz99E8w4Xg9JB9E_viAa/view?usp=sharing
Best, Deming