tomaarsen / SpanMarkerNER

SpanMarker for Named Entity Recognition
https://tomaarsen.github.io/SpanMarkerNER/
Apache License 2.0
398 stars 28 forks source link

Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing #23

Closed tomaarsen closed 1 year ago

tomaarsen commented 1 year ago

Hello!

This is a heads up that (XLM-)RoBERTa-based SpanMarker models require text to be preprocessed to separate punctuation from words:

# ✅
model.predict("He plays J. Robert Oppenheimer , an American theoretical physicist .")
# ❌
model.predict("He plays J. Robert Oppenheimer, an American theoretical physicist.")

# You can also supply a list of words directly: ✅
model.predict(["He", "plays", "J.", "Robert", "Oppenheimer", ",", "an", "American", "theoretical", "physicist", "."])

This is a consequence of the RoBERTa tokenizer distinguishing , and , as different tokens, and the SpanMarker model is only familiar with the , variant.

Another alternative is to use the spaCy integration, which preprocesses the text into words for you!

The (m)BERT-based SpanMarker models do not require this preprocessing.

stefan-it commented 1 year ago

Hey @tomaarsen ,

I have may a possible alternative solution: In Flair we also can construct and predict from sentences that are given by user. For this tokenization problem we use the v1 version of segtok. You could split the input into sentences, but for just tokenizing the word_tokenizer can be used:

https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L210

I think this could easily be added in the Model Hub Inference logic:

https://github.com/huggingface/api-inference-community/blob/main/docker_images/span_marker/app/pipelines/token_classification.py#L35

So inputs could first be tokenized by word_tokenizer. I think that segtok would be a great alternative and more lightweight compared to spaCy.

Another alternative: not just implementing it on the Model Hub side: maybe it can be implemented in model.predict directly 🤔

tomaarsen commented 1 year ago

I'll certainly consider this approach, whether with segtok, spaCy or NLTK. The spaCy version is already implemented.

By default, perhaps I can apply the tokenization only if Hello, there. tokenizes differently than Hello , there .?

tomaarsen commented 1 year ago

I've discovered that the issue only persisted for XLM-RoBERTa, and I've been able to tackle it in f2edd06072aac2110b63aa9a7f1c52e45d6c6710!