sinaahmadi / klpt

The Kurdish Language Processing Toolkit
https://sinaahmadi.github.io/klpt/
Other
91 stars 11 forks source link

initial implementation of analysis using ATT files #5

Closed ftyers closed 3 years ago

ftyers commented 3 years ago

This PR adds support for morphological analysis using ATT files.

$ cat test.py 
from klpt import analysis
a = analysis.Analysis("Kurmanji", "Latin")
print(a.analyse('dixwî'))
print(a.analyse('dengdanekê'))
print(a.analyse('xêzikine'))
$ python3 test.py 
('dixwî', [[]])
('dengdanekê', [[('@0@dengdan<n><f><sg><con><ind>', 12669), ('@0@dengdan<n><f><sg><obl><ind>', 12669)]])
('xêzikine', [[('@0@xêzik<n><f><pl><con><ind>', 12669)]])

There are still a few bugs in the implementation, e.g. dixwî should work:

$ echo dixwî | apertium -d . kmr-morph | tok.sh 
^dixwî/xwarin<vblex><tv><pri><p2><sg>$

But it's a good start. I'm happy to do some cleanup if you would like.

sinaahmadi commented 3 years ago

Thanks so much, Francis for this interesting extension.

I just reckon we should leave the morphological analysis, similar to Sorani, to the stem module and instead import your analysis module within stem. That way, users will follow the same procedure for both dialects.

Do you want to modify your code before I merge your request? I can also take care of it, if you would like so.

ftyers commented 3 years ago

I also added a ckb mode using apertium-ckb, but I am fine with whichever layout. There are still a few bugs, but I wanted to start the PR to get early feedback. I'm happy to have it merged so you can try it out, but if you want me to fix some stuff first I can also do that.

Perhaps we could do it like this actually, merge this and organise it how you would like (I am definitely not an idiomatic writer of Python!). Then I can open another PR for fixing some of the algorithmic bugs.

sinaahmadi commented 3 years ago

Sounds good to me 🙂 Regarding ckb, the current analyzer works pretty good. I think we should only add the Kurmanji analyzer, as a start. I merge it and try it locally. Thanks again so very much!