qanastek / ANTILLES

ANTILLES : An Open French Linguistically Enriched Part-of-Speech Corpus
https://hal.archives-ouvertes.fr/hal-03696042/document
Creative Commons Attribution Share Alike 4.0 International
5 stars 0 forks source link

verbs ending in `-issions` tokenized incorrectly #1

Open joprice opened 1 month ago

joprice commented 1 month ago

When using a model like qanastek/pos-french-camembert, a verb such as finissions results in multiple tokens with VERB entities like ["fini" VERB", "ssions" VERB]. This does not happen with the flair based model, but unfortunately I can't figure out how to export that one to onnx, so I'm unable to integrate it currently.

joprice commented 1 month ago

After reading through the TokenClassificationPipeline code a bit, it seems aggregation_strategy="simple" solves this, resulting in a single item in the result set for the verb with the field entity_group fields instead of entity.

joprice commented 1 month ago

It looks like this strategy is based on tokens that include "I-" and "B-" prefixes https://github.com/huggingface/transformers/blob/c85510f958e6955d88ea1bafb4f320074bfbd0c1/src/transformers/pipelines/token_classification.py#L550. I'm not familiar enough to know if I should expect them to appear in this kind of model. However, without detecting token boundaries, adjacent nouns and adjectives will be grouped into single tokens. Also, other implementations might lag behind supporting strategies like this, e.g. https://github.com/xenova/transformers.js/issues/633.

My first hunch is to that the solution is to modify the model's tokenizer to add the extra token prefixes to correctly merge token groups, but not sure if there's actually an earlier issue with something like lemmatization where the verb is being incorrectly split into root 'fini' and its suffix.

joprice commented 1 month ago

I just found this article https://medium.com/thecyphy/training-custom-ner-model-using-flair-df1f9ea9c762 which clarifies the use of the prefixes in NER tagging and makes sense now that the flair model that uses a sequence tagger can handles this case.