As a developer, I want to lemmatize LRL words by segmenting them (similar to the approach in this paper).
Motivation: improve lemma groups -> reduce false positives -> increase dictionary creation precision
Tasks
[ ] read approach of paper
[ ] e.g., use Byte-Pair Encoding (BPE)
use tokenizer in more fine-granular configuration?
Goal
As a developer, I want to lemmatize LRL words by segmenting them (similar to the approach in this paper). Motivation: improve lemma groups -> reduce false positives -> increase dictionary creation precision
Tasks