sinaahmadi / klpt

The Kurdish Language Processing Toolkit
https://sinaahmadi.github.io/klpt/
Other
93 stars 12 forks source link

Lemmatize before MWE tokinization #21

Closed cikay closed 3 months ago

cikay commented 3 months ago

The sentence should be lemmatized before MWE tokenization because when multi-word is a verb it is conjugated based on subject.

For example: "Em şermezar dikin"

With the current implementation when you try MWE tokenization the above sentence does not recognize "şermezar dikin" because available form tokens are only "şermazar kirin" and "şermazarkirin". There is no "şermezar dikin" form. Which makes sense. So firstly it should be lemmatized.

sinaahmadi commented 3 months ago

Thanks for raising this issue. Although MWE tokenization prior to lemmatization is the case, it is left to the user to do it and not the pipeline.