hipster-philology / pyrrha

A language-independent post-correction app for POS-tagging and lemmatization
https://pyrrha.huma-num.fr
MIT License
27 stars 16 forks source link

Improve agglutinated forms control #326

Open matgille opened 3 months ago

matgille commented 3 months ago

Is your feature request related to a problem? Please describe. In case of a contraction or agglutination, the control lists won't be used properly and the lemmas (and pos, but the number of possible combinations is much lower) will always be marked as unauthorized.

Example:

The form aunquel is the contraction of lemmas aunque and el. In our project it will be tagged as aunque+el. Even if both lemmas are in the list, an error will be raised, because aunque+el is not in the control list.

Describe the solution you'd like As the delimiter for contractions is always the same, it should be possible for the engine to split the analysis using the delimiter. It would require for the user to add the delimiter information somewhere (in the control list panel I would say).

In the above example, aunque+el would be analyzed as two lemmas: aunque and el, each of them being compared to the control list. An error would be raised only if one of the lemmas are not in the list. A warning would tell the user that the analysis is wrong (and could indicate which lemma/POS is not in the control list)