Closed tomaarsen closed 1 year ago
Hey @tomaarsen ,
I have may a possible alternative solution: In Flair we also can construct and predict from sentences that are given by user. For this tokenization problem we use the v1 version of segtok
. You could split the input into sentences, but for just tokenizing the word_tokenizer
can be used:
https://github.com/fnl/segtok/blob/master/segtok/tokenizer.py#L210
I think this could easily be added in the Model Hub Inference logic:
So inputs
could first be tokenized by word_tokenizer
. I think that segtok
would be a great alternative and more lightweight compared to spaCy
.
Another alternative: not just implementing it on the Model Hub side: maybe it can be implemented in model.predict
directly 🤔
I'll certainly consider this approach, whether with segtok
, spaCy
or NLTK
. The spaCy
version is already implemented.
By default, perhaps I can apply the tokenization only if Hello, there.
tokenizes differently than Hello , there .
?
I've discovered that the issue only persisted for XLM-RoBERTa, and I've been able to tackle it in f2edd06072aac2110b63aa9a7f1c52e45d6c6710!
Hello!
This is a heads up that (XLM-)RoBERTa-based SpanMarker models require text to be preprocessed to separate punctuation from words:
This is a consequence of the RoBERTa tokenizer distinguishing
,
and,
as different tokens, and the SpanMarker model is only familiar with the,
variant.Another alternative is to use the spaCy integration, which preprocesses the text into words for you!
The (m)BERT-based SpanMarker models do not require this preprocessing.