Open joprice opened 1 month ago
After reading through the TokenClassificationPipeline code a bit, it seems aggregation_strategy="simple"
solves this, resulting in a single item in the result set for the verb with the field entity_group
fields instead of entity
.
It looks like this strategy is based on tokens that include "I-" and "B-" prefixes https://github.com/huggingface/transformers/blob/c85510f958e6955d88ea1bafb4f320074bfbd0c1/src/transformers/pipelines/token_classification.py#L550. I'm not familiar enough to know if I should expect them to appear in this kind of model. However, without detecting token boundaries, adjacent nouns and adjectives will be grouped into single tokens. Also, other implementations might lag behind supporting strategies like this, e.g. https://github.com/xenova/transformers.js/issues/633.
My first hunch is to that the solution is to modify the model's tokenizer to add the extra token prefixes to correctly merge token groups, but not sure if there's actually an earlier issue with something like lemmatization where the verb is being incorrectly split into root 'fini' and its suffix.
I just found this article https://medium.com/thecyphy/training-custom-ner-model-using-flair-df1f9ea9c762 which clarifies the use of the prefixes in NER tagging and makes sense now that the flair model that uses a sequence tagger can handles this case.
When using a model like
qanastek/pos-french-camembert
, a verb such asfinissions
results in multiple tokens with VERB entities like["fini" VERB", "ssions" VERB]
. This does not happen with the flair based model, but unfortunately I can't figure out how to export that one to onnx, so I'm unable to integrate it currently.