sinaahmadi / klpt

The Kurdish Language Processing Toolkit
https://sinaahmadi.github.io/klpt/
Other
93 stars 12 forks source link

morphology for kurmanji #3

Closed ftyers closed 3 years ago

ftyers commented 3 years ago

The apertium project has a morphological analyser for Kurmanji: https://github.com/apertium/apertium-kmr

You could include it to get morphological analysis for Kurmanji. :)

It recognises 342870 forms and you can get a full form list using the lt-expand tool:

$ lt-expand apertium-kmr.kmr.dix  | wc -l
342870
sinaahmadi commented 3 years ago

Thanks, Francis. I was aware of it and will definitely include it. Thanks for your kind attention 😊

ftyers commented 3 years ago

If you would like help using the Apertium tools, please feel free to get in contact with us, either here or on IRC or on our mailing list. We know that our documentation isn't the best in the world!

I'm also happy to do some data conversion if it's more convenient for you.

sinaahmadi commented 3 years ago

So, my initial idea was to update the current Hunspell system of Kurmanji which is outdated. A while ago, I was asked by the Open Office community to update it.

I think the best idea would be to convert the Apertium data into a Hunspell Affix file; une pierre, deux coups! Otherwise, using a wrapper should do the job to directly integrate Apertium in KLPT.

ftyers commented 3 years ago

Hmm, I think either of those methods could work. If I remember correctly @flammie has some code for this. There is an apertium-python package, but it's very alpha, and you might be better just writing a parser for the output of lt-expand.

ftyers commented 3 years ago

Another thing that could be done is just add support for ATT format files, which are transducers in the following format:

$ lt-print kmr.automorf.bin  | head
0   1   y   y   0.000000    
0   2   ĂȘ   ĂȘ   0.000000    
0   2   e   ĂȘ   0.000000    
0   3   b   b   0.000000    
0   4   d   d   0.000000    
ftyers commented 3 years ago

I made a basic PR in #5, this code is kind of pedagogical so it isn't super optimised. MÄns Huldén has some code for processing ATT files too, you can find it here.

Btw, I had the idea after reading your release notes. As you might be able to tell, I am a huge fan of automata as well! :D

sinaahmadi commented 3 years ago

Yeah 😁 Automata are really fun!

Thank you very much again, Francis. This was so quick and efficient. Your contribution should be available in the next release and can be used the exact same way as Sorani:

from klpt.stem import Stem
morph_analyzer = Stem("Kurmanji", "Latin")
print(morph_analyzer.analyze("bibĂȘje"))
[{'base': 'gotin', 'description': 'vblex_tv_prs_p3_sg', 'pos': '', 'terminal_suffix': '', 'formation': ''}, {'base': 'gotin', 'description': 'vblex_tv_imp_p2_sg', 'pos': '', 'terminal_suffix': '', 'formation': ''}, {'base': 'gotin', 'description': 'vblex_tv_fut_p3_sg', 'pos': '', 'terminal_suffix': '', 'formation': ''}]

There are some delicate details that I'll take care of later, particularly structuring the output of theATT analyzer. Here you can see how it is integrated in the stem module.

You should also appear in the contributors section in the README soon ;-)