linuxscout / pyarabic

pyarabic
GNU General Public License v3.0
450 stars 85 forks source link

Tokenize words #52

Closed aljbri closed 3 years ago

aljbri commented 3 years ago

In Tokenize part, it didn't separate the character و from the word when it is not a part of the original words, like in the example:

>>> from pyarabic.araby import tokenize, is_arabicrange, strip_tashkeel
>>> text = u"ِاسمٌ الكلبِ في اللغةِ الإنجليزية Dog واسمُ الحمارِ Donky"
>>> tokenize(text, conditions=is_arabicrange, morphs=strip_tashkeel)
        ['اسم', 'الكلب', 'في', 'اللغة', 'الإنجليزية', 'واسم', 'الحمار']
linuxscout commented 3 years ago

Salam, Thank you for your message, The tokenization process spearate words from texts only, It doesn't make any analysis on words. If you want to get lemma or stems from words, I suggest to use Qalsadi Morphological analyzer . Or you can use only stemmer like Tashaphyne to extract stems.