sinaahmadi / klpt

The Kurdish Language Processing Toolkit
https://sinaahmadi.github.io/klpt/
Other
93 stars 12 forks source link

-istan #12

Closed rojvv closed 3 years ago

rojvv commented 3 years ago

In the preprocess.py, you have said something like this:

...but to normalize the text in terms of the encoding and common writing rules.

But the usage of the "îy" suffix as "amraza diyarkirinê/ئامڕازی دانەپاڵ" with a word which ends with an "î" has never been common in Kurmanji/Latin script (maybe in Sorani/Arabic script tho). And you should have noticed this long ago.

So please, don't replace "iy" with "îy" and "istan" with "îstan" because they have never been common.

sinaahmadi commented 3 years ago

Hi @rojserbest, Thanks for raising this issue. My reference for Kurmanji normalization is Rêbera Rastnivîsînê based on which the current replacement seems to make sense.

That said, I am totally aware that Kurdish writing is yet to be standardized, and therefore, am open to any suggestion.

rojvv commented 3 years ago

Well, have you seen anywhere writing Kurdîstan/کوردیستان?

rojvv commented 3 years ago

Oh my god, you got it wrong @sinaahmadi, it says don't write -îstan: image

rojvv commented 3 years ago

OK I am changing the PR to make it not change îy, just istan. So it matches with your reference.

rojvv commented 3 years ago

Done, so now it only fixes the -istan.

sinaahmadi commented 3 years ago

Thanks. I will fix this in the upcoming version. Thanks for raising the issue.

rojvv commented 12 months ago

Hi @sinaahmadi, sorry to bother. Do you know that many people have complained about the book that is used as a reference for Kurmanji affixes, Komxebata Kurmancî -- Rêbera Rastnivîsînê? Including, but not limited to, members of Komxebata Kurmancî. You will find a lot of them here: https://zimannas.wordpress.com/category/rastnivisin/

I recommend starting with https://zimannas.wordpress.com/2020/02/15/bersivek-bo-erisen-bahoz-barani/.

Regarding “iy”:

Even Mîr Celadet, the founder of the Kurdish Latin alphabet asserted that the letter “y” cannot come after the letter “î”, while Rêbera Rastnivîsînê says otherwise. “îy” is very uncommon even in sources that have a lot of other spelling mistakes. You won’t find that nearly everywhere.

I think it is better not to follow Rêbera Rastnivîsînê so that people is not mislead. I am not enforcing it in anyway, I just wanted to let you know since you were open to suggestions :)