Try out word segmentation approach for Greek words on LRL words

Goal

As a developer, I want to lemmatize LRL words by segmenting them (similar to the approach in this paper). Motivation: improve lemma groups -> reduce false positives -> increase dictionary creation precision

Tasks

[ ] read approach of paper
[ ] e.g., use Byte-Pair Encoding (BPE)
- use tokenizer in more fine-granular configuration?
[ ] try out https://github.com/google/sentencepiece
- treats space as character
- works for languages with no spaces

janetzki / GUIDE

Try out word segmentation approach for Greek words on LRL words #6

Goal

Tasks