Wrong matching for Arabic

curquiza commented 3 years ago

ahmedkrmn commented 2 years ago

Hi Clémentine, I've been trying to work on this issue. After some testing, I came to the following conclusion:

v0.1.2 [Good]: Works as expected. Searching with ا displays words with both ا and أ. Searching with أ displays words with both أ and ا. This is the expected behavior when searching Arabic words.
v0.1.3 [Bad]: Introduced in d9ee1326fe9eca138f49b758bfa1c4bdb1aa4807. Searching with ا displays words with both ا and أ, but searching with أ displays neither.
v0.1.4 [Bad]: Same behavior as v0.1.3.
v0.2.0 till main [Bad]: Searching with ا displays words with ا only. Searching with أ displays words with أ only.

What do you suggest doing to fix this?

curquiza commented 2 years ago

Hello @ahmedkrmn thanks for your interest! 😁

@ManyTheFish can help you on this when he will have the time :)

ManyTheFish commented 2 years ago

Hello @ahmedkrmn are you sure that deunicoding Arabic script is a good thing to do? the sentence

المتعة والمرح في تعلم العربية

would be deunicoded as

lmt`@ wlmrH fy t`lm l`rby@

🤔

I can't write Arabic script, so I don't know what should be the good behavior.

Reex11 commented 2 years ago

Hello @ManyTheFish, I believe that that the characters أ ا إ آ should be processed in a way similar to "lowercasing". So when a user search for a query containing for example احمد he should be able to receive all these variations أحمد احمد إحمد آحمد.

ManyTheFish commented 2 years ago

Hello @Reex11, I will investigate your case, 🤔 I tried if the lowercase function of rust could help us but no: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=92777875ea531819640dfedea3d42395

Is there a name of this similar process to "lowercasing"? This could help me to find a library or a function that would do the job.

Thanks for your help 😁

Reex11 commented 2 years ago

Thank you for Investigating this. Actually, its not literally lowercasing 😅, Its basically a kind of normalization proccess. There is a new Arabic text processing library released recently Maha, it will be very useful as it has already done a lot of work in Arabic text normalization.

You need to know this first: ا letter is called Alef ء this symbol is called Hamza أ this letter is Alef with Hamza above

Now, this library is calling this process normalization which I believe is right. Here you can find Alef Variations And here you can find what is called Alef Variations Normalization

I'll dig around to see if there's anything else to consider.

Reex11 commented 2 years ago

Hi again, I found the following:

Harakat - like these َ ِ ُ - should be totally ignored. So, removing them should be part of normalization process.
Ta' Marbota letter ة should be normalized to Ha' letter ه.
Waw letter و is a stop word usually - it means and -, there maybe an issue here because the letter Waw is not always a stop word. for example سماء وأرض here the Waw letter is a stop word (Translation Earth and Sky). But in other cases its not a stop word. Ex. كتاب وليد here the Waw letter is part of an actual word, (Translation Waleed's Book)

I'll lookup for a solution for Waw stopword. And I already have some workarounds in mind. I understand that you may face difficulties in understanding some parts of the languages. So, Let me know if you need any help.

ManyTheFish commented 2 years ago

Hello @Reex11! Thanks for your help, we have to design or find a specialized normalizer for this. I have a question about tokenization, are words only space-separated?

Reex11 commented 2 years ago

Hi @ManyTheFish, First, You should know that I have basic knowledge about NLP.

I think that there are a lot of cases that are not space-separated. But its ok to start with space-separation. ( and I believe that this is the general case in Arabic supported tokenizers I seen ) Although, There are some important and common conditions that need to be considered to improve the search results. Such as And => و , The => الـ

Example: الشجرة => The Tree is a combination of الـ and شجرة الـ is equivalent to The and its always connected (not space separated) to the next word.

I found a great Arabic NLP library, I think its the best so far. Its called CAMeL tools

curquiza commented 2 years ago

Closed in favor of https://github.com/meilisearch/product/discussions/139 Any contribution to add an Arabic normalizer and segmenter is welcomed!

meilisearch / charabia

Wrong matching for Arabic #36