Unexpectedly (bad) predictions?

mbauwens commented 1 year ago

I'm just trying out the pretrained models accompanying this repo via HF Spaces and I'm seeing some weird results.

For some models there's a huge difference in quality when I include/exclude a period. E.g. the difference between the two sentences "This is James" vs "This is James." sometimes causes models to not be able to recognise "James" as a person.
The multilingual model was not able to recognise major cities in a straightforward Dutch sentence ("Ik woon in Leuven." - also tried it with "Amsterdam", and "Parijs").

Am I using the pretrained models wrong? Is it expecting different kinds of inputs?

I can elaborate and test more, but I figured I'd post this first. 😊

Thanks!

tomaarsen commented 1 year ago

Hello! Thanks for your report.

tl;dr: You can avoid these issues:

Always append punctuation to your texts if the model was trained with punctuation at the end.
Always separate your punctuation from your words, e.g. Amsterdam, in -> Amsterdam , in. I'll look into ways to avoid having to do this.

In more details now:

As for your first comment, I've previously encountered issues that the model simply could not include the last token as an entity (#1), but since fixing that one, most of the models still tend not to include the last token as an entity, even if it's an obvious one like Paris. I think this issue originates from the training data always including punctuation at the end. For example, the CoNLL03 dataset, which does not always have a dot, does not seem to have this issue: https://huggingface.co/tomaarsen/span-marker-xlm-roberta-large-conll03?text=This+is+James

As for the second issue: The multilingual model was trained on English data exclusively, meaning that the multilingual behaviour must originate from the underlying XLM-RoBERTa-large encoder. I've just done some testing, and it seems that there is some odd behaviour here: if punctuation is attached to the words, then it cannot detect it. But if punctuation is separate, then it works well:

This may be related to how Amsterdam and Amsterdam tokenize differently. If the underlying encoder is only familiar with one of the two ways, then it makes sense that it doesn't really know how to deal with the other one. I think I should be able to resolve this with pre-processing of prediction sentences.

When using a model in practice, you can also provide a list of words, e.g. ["Ik", "woon", "in", "Amsterdam", ",", "in", "Nederland", "."]. This should always work as intended - this is also how the model is trained.

I'll see if there are things I can do to improve the performance under these two circumstances.

Tom Aarsen

mbauwens commented 1 year ago

Interesting! So, the pretrained models are then all intended for English NER, right? I'm going to try out cleaning/tokenising myself and then passing the processed input to the model. Thanks for the feedback 👍

tomaarsen commented 1 year ago

The xlm models should work with other languages, too. but I didn't explicitly train them for anything other than English. These models seem to suffer from bad performance when the punctuation is attached, as mentioned. I think you'll find that they perform well when they're pre-tokenized (or rather, converted to words).

I just remembered, the spaCy integration automatically does this conversion to words.

import spacy

# Load the spaCy model with the span_marker pipeline component
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker",
    config={"model": "tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super"}
)

# Feed some text through the model to get a spacy Doc
text = "Ik woon in Amsterdam, in Nederland."
doc = nlp(text)

# And look at the entities
print([(entity, entity.label_) for entity in doc.ents])
"""
[(Amsterdam, 'location-GPE'), (Nederland, 'location-GPE')]
"""

from spacy import displacy
displacy.serve(doc, style="ent")

tomaarsen commented 1 year ago

I've narrowed this down to the roberta and xlm-roberta models - they tokenize my training data differently than how real sentences are tokenized. So, for the roberta models you'd have to make sure that there's spaces before all punctuation. I'm trying to figure out if there's a convenient solution.

tomaarsen commented 1 year ago

I'll close this in favor of #23, as I've narrowed down the issue. For the RoBERTa based models, you either have to 1) separate the punctuation yourself or 2) use the spaCy integration to have spaCy do the punctuation separation for you.

The BERT-based models function as expected.

tomaarsen / SpanMarkerNER

Unexpectedly (bad) predictions? #20