linuxscout / pyarabic

pyarabic
GNU General Public License v3.0
450 stars 85 forks source link

normalize_alef converts YEHLIKE to Alef #39

Closed thexaib closed 4 years ago

thexaib commented 4 years ago

Hi, I am an Urdu speaker. I am trying to convert arabic words to simple words . I found the stated behavior, please correct me if i am wrong , may be it is intended or correct behavior. If that's the case how can i not change last YEH (small etc) to alef.

words = ['إِلَّا','إِلَىٰ','بِى','بِٱلْهُدَىٰ','بِٱلَّذِىٓ','بِلِقَآئِ','بُغِىَ','ٱلْمَأْوَىٰ']
for w in words:
    simple=w
    print(f'word : {simple}')

    simple=araby.strip_diacritics(simple)
    print(f'strip_diacritics : {simple}')

    simple=araby.normalize_alef(simple)
    print(f'normalize_alef : {simple}')

    simple=araby.normalize_hamza(simple)
    print(f'normalize_hamza : {simple}')

    print('_'*10)

outputs:

word : إِلَّا
strip_diacritics : إلا
normalize_alef : الا
normalize_hamza : الا
__________
word : إِلَىٰ
strip_diacritics : إلى
normalize_alef : الا
normalize_hamza : الا
__________
word : بِى
strip_diacritics : بى
normalize_alef : با
normalize_hamza : با
__________
word : بِٱلْهُدَىٰ
strip_diacritics : بٱلهدى
normalize_alef : بالهدا
normalize_hamza : بالهدا
__________
word : بِٱلَّذِىٓ
strip_diacritics : بٱلذى
normalize_alef : بالذا
normalize_hamza : بالذا
__________
word : بِلِقَآئِ
strip_diacritics : بلقائ
normalize_alef : بلقائ
normalize_hamza : بلقاء
__________
word : بُغِىَ
strip_diacritics : بغى
normalize_alef : بغا
normalize_hamza : بغا
__________
word : ٱلْمَأْوَىٰ
strip_diacritics : ٱلمأوى
normalize_alef : الماوا
normalize_hamza : الماوا
linuxscout commented 4 years ago

Salam, I will work on it, I think that we can give more details about how Alef like letters will be converted.

Thanks

linuxscout commented 4 years ago

I add new features to normalizing functions

import pyarabic.araby as araby text1 = u"جاء سؤال الأئمة عن الإسلام آجلا" araby.normalize_hamza(text1) 'جاء سءال الءءمة عن الءسلام ءءجلا' araby.normalize_hamza(text1, method="tasheel") 'جاء سوال الايمة عن الاسلام اجلا'