Closed mbauwens closed 1 year ago
Hello! Thanks for your report.
tl;dr: You can avoid these issues:
Amsterdam, in
-> Amsterdam , in
.
I'll look into ways to avoid having to do this.In more details now:
As for your first comment, I've previously encountered issues that the model simply could not include the last token as an entity (#1), but since fixing that one, most of the models still tend not to include the last token as an entity, even if it's an obvious one like Paris
. I think this issue originates from the training data always including punctuation at the end. For example, the CoNLL03 dataset, which does not always have a dot, does not seem to have this issue: https://huggingface.co/tomaarsen/span-marker-xlm-roberta-large-conll03?text=This+is+James
As for the second issue: The multilingual model was trained on English data exclusively, meaning that the multilingual behaviour must originate from the underlying XLM-RoBERTa-large encoder. I've just done some testing, and it seems that there is some odd behaviour here: if punctuation is attached to the words, then it cannot detect it. But if punctuation is separate, then it works well:
This may be related to how Amsterdam
and Amsterdam
tokenize differently. If the underlying encoder is only familiar with one of the two ways, then it makes sense that it doesn't really know how to deal with the other one. I think I should be able to resolve this with pre-processing of prediction sentences.
When using a model in practice, you can also provide a list of words, e.g. ["Ik", "woon", "in", "Amsterdam", ",", "in", "Nederland", "."]
. This should always work as intended - this is also how the model is trained.
I'll see if there are things I can do to improve the performance under these two circumstances.
Interesting! So, the pretrained models are then all intended for English NER, right? I'm going to try out cleaning/tokenising myself and then passing the processed input to the model. Thanks for the feedback 👍
The xlm
models should work with other languages, too. but I didn't explicitly train them for anything other than English. These models seem to suffer from bad performance when the punctuation is attached, as mentioned. I think you'll find that they perform well when they're pre-tokenized (or rather, converted to words).
I just remembered, the spaCy integration automatically does this conversion to words.
import spacy
# Load the spaCy model with the span_marker pipeline component
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker",
config={"model": "tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super"}
)
# Feed some text through the model to get a spacy Doc
text = "Ik woon in Amsterdam, in Nederland."
doc = nlp(text)
# And look at the entities
print([(entity, entity.label_) for entity in doc.ents])
"""
[(Amsterdam, 'location-GPE'), (Nederland, 'location-GPE')]
"""
from spacy import displacy
displacy.serve(doc, style="ent")
I've narrowed this down to the roberta
and xlm-roberta
models - they tokenize my training data differently than how real sentences are tokenized. So, for the roberta
models you'd have to make sure that there's spaces before all punctuation. I'm trying to figure out if there's a convenient solution.
I'll close this in favor of #23, as I've narrowed down the issue. For the RoBERTa based models, you either have to 1) separate the punctuation yourself or 2) use the spaCy integration to have spaCy do the punctuation separation for you.
The BERT-based models function as expected.
I'm just trying out the pretrained models accompanying this repo via HF Spaces and I'm seeing some weird results.
Am I using the pretrained models wrong? Is it expecting different kinds of inputs?
I can elaborate and test more, but I figured I'd post this first. 😊
Thanks!