Closed curquiza closed 2 years ago
Hi Clémentine, I've been trying to work on this issue. After some testing, I came to the following conclusion:
v0.1.2
[Good]: Works as expected. Searching with ا
displays words with both ا
and أ
. Searching with أ
displays words with both أ
and ا
. This is the expected behavior when searching Arabic words.v0.1.3
[Bad]: Introduced in d9ee1326fe9eca138f49b758bfa1c4bdb1aa4807. Searching with ا
displays words with both ا
and أ
, but searching with أ
displays neither.v0.1.4
[Bad]: Same behavior as v0.1.3
.v0.2.0
till main
[Bad]: Searching with ا
displays words with ا
only. Searching with أ
displays words with أ
only. What do you suggest doing to fix this?
Hello @ahmedkrmn thanks for your interest! 😁
@ManyTheFish can help you on this when he will have the time :)
Hello @ahmedkrmn are you sure that deunicoding Arabic script is a good thing to do? the sentence
المتعة والمرح في تعلم العربية
would be deunicoded as
lmt`@ wlmrH fy t`lm l`rby@
🤔
I can't write Arabic script, so I don't know what should be the good behavior.
Hello @ManyTheFish,
I believe that that the characters أ
ا
إ
آ
should be processed in a way similar to "lowercasing".
So when a user search for a query containing for example احمد
he should be able to receive all these variations أحمد
احمد
إحمد
آحمد
.
Hello @Reex11, I will investigate your case, 🤔 I tried if the lowercase function of rust could help us but no: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=92777875ea531819640dfedea3d42395
Is there a name of this similar process to "lowercasing"
? This could help me to find a library or a function that would do the job.
Thanks for your help 😁
Thank you for Investigating this. Actually, its not literally lowercasing 😅, Its basically a kind of normalization proccess. There is a new Arabic text processing library released recently Maha, it will be very useful as it has already done a lot of work in Arabic text normalization.
You need to know this first:
ا
letter is called Alef
ء
this symbol is called Hamza
أ
this letter is Alef with Hamza above
Now, this library is calling this process normalization
which I believe is right.
Here you can find Alef Variations
And here you can find what is called Alef Variations Normalization
I'll dig around to see if there's anything else to consider.
Hi again, I found the following:
َ
ِ
ُ
- should be totally ignored. So, removing them should be part of normalization process.ة
should be normalized to Ha' letter ه
.و
is a stop word usually - it means and
-, there maybe an issue here because the letter Waw is not always a stop word.
for example سماء وأرض
here the Waw letter is a stop word (Translation Earth and Sky
).
But in other cases its not a stop word. Ex. كتاب وليد
here the Waw letter is part of an actual word, (Translation Waleed's Book
)I'll lookup for a solution for Waw stopword. And I already have some workarounds in mind. I understand that you may face difficulties in understanding some parts of the languages. So, Let me know if you need any help.
Hello @Reex11! Thanks for your help, we have to design or find a specialized normalizer for this. I have a question about tokenization, are words only space-separated?
Hi @ManyTheFish, First, You should know that I have basic knowledge about NLP.
I think that there are a lot of cases that are not space-separated.
But its ok to start with space-separation. ( and I believe that this is the general case in Arabic supported tokenizers I seen )
Although, There are some important and common conditions that need to be considered to improve the search results.
Such as And
=> و
, The
=> الـ
Example:
الشجرة
=> The Tree
is a combination of الـ
and شجرة
الـ
is equivalent to The
and its always connected (not space separated) to the next word.
I found a great Arabic NLP library, I think its the best so far. Its called CAMeL tools
Closed in favor of https://github.com/meilisearch/product/discussions/139 Any contribution to add an Arabic normalizer and segmenter is welcomed!
Related to https://github.com/meilisearch/MeiliSearch/issues/1331