ljvmiranda921 / calamanCy

NLP pipelines for Tagalog using spaCy
MIT License
45 stars 3 forks source link

Lemmatizer #31

Open wadid opened 11 months ago

wadid commented 11 months ago

Hi, is there something like a lemmatizer? I have a couple of tagalog sentences with translations and I am trying to lemmatize them (then do some sorting by frequency and then use it myself for language learning ;))

ljvmiranda921 commented 11 months ago

Hi @wadid , this is still something in the works. For context, I will be using spaCy's neural edit-tree lemmatizer for this. I am not sure what my timeline would be, perhaps late December. If you're in a rush, I suggest training your own lemmatizer for now.

Another option is to lemmatize in a rules-based approach. However, that might require more research to the exact lemmatization rules for Tagalog.

wadid commented 11 months ago

Do you know this project? https://github.com/crlwingen/TagalogStemmerPython Accuracy rate of 94,12%. How good is that?

ljvmiranda921 commented 10 months ago

Hi thanks for this, I think a 94.12% accuracy should be decent given that Tagalog lemmatization rules can be complicated given the agglutinative nature of the language. Right now, I'm trying to port both into calamanCy (rules-based using that stemmer and a neural-based one using spaCy's edit-tree lemmatization).