Different results on ontonotes

YeDeming commented 1 year ago

Hi Tom,

Sorry for missing your email address. I noticed that you get slightly worse results on ontonotes dataset.

There are several versions of ontonotes. In PL-Marker, I preprocess CONLL-2012 by https://drive.google.com/file/d/1cFVb5thXNCXZkz99E8w4Xg9JB9E_viAa/view?usp=sharing

Best, Deming

tomaarsen commented 1 year ago

Awesome! I may have gotten mixed up with the various versions. I've used this one myself: https://huggingface.co/datasets/tner/ontonotes5 I'll dig into yours to get a better understanding, as I couldn't quickly explain the difference in precision.

Thanks for the heads up!

Tom Aarsen

tomaarsen commented 1 year ago

It seems that your dataset has 75187 train sentences, 9603 dev sentences and 9479 test sentences, compared to 59.9k train sentences, 8.53k for validation and 8.26k for the test set. I'm afraid this also differs from what I reported in my thesis, which must have been based on another version.

I do see matching sentences between the two versions here, so it's at least partially the same source. Perhaps one is de-duplicated.

YeDeming commented 1 year ago

You can further delete the sentence whose doc_key=='pt/xx'. These 17,555 sentences have no NER annotations. I remember that I can get similar results without the pt data.

By the way, I upload the conll.py to the google cloud.

tomaarsen commented 1 year ago

When removing the pt/xx files, the dataset seems to exactly match https://huggingface.co/datasets/tner/ontonotes5 in number of samples. However, I can use document-level context with your variant. I.e. add previous and next sentences as context. I'll run some quick experiments with this.

tomaarsen commented 1 year ago

@YeDeming I've trained a quick model using your data and ontonotesv5.py, but then modified like so:

    model_name = "roberta-large"
    model = SpanMarkerModel.from_pretrained(
        model_name,
        labels=labels,
        # SpanMarker hyperparameters:
        model_max_length=256,
        marker_max_length=128,
        entity_max_length=10,
+       max_prev_context=2,
+       max_next_context=2,
    )

This adds a maximum of two sentence as context on either side, both during training and inference. My experiments with CoNLL03 showed that including as much context as possible is best, but just using 2 is a bit quicker to train.

My first model trained in this manner reached 91.796 F1 (91.629 Precision & 91.963 Recall). As expected - using this document-level context improves the performance over my previous results (91.35 F1, 90.96 Precision, 91.75 Recall). If I manage to publish this as a paper, then I'll update the findings to include experiments with document-level context.

These findings seem to indicate that SpanMarker is also on par with PL-Marker on OntoNotes (which reaches 91.9±0.1 F1), especially if I rerun these experiments with "as much context as possible"-settings. Only on OntoNotes was there a notable difference in performance, so this is very exciting indeed. I think that I can now confidently say that SpanMarker reaches PL-Marker level performance, while providing a library that can be used for practitioners looking to solve their own NER problems. I appreciate your assistance on this!

Tom Aarsen

YeDeming commented 1 year ago

Glad to hear your results. Let me know if there is anything that I can help!

Deming

tomaarsen commented 1 year ago

I'll close this for now. Thanks for your assistance!

tomaarsen / SpanMarkerNER

Different results on ontonotes #18