meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents
MIT License
250 stars 86 forks source link

Wrong matching for Arabic #36

Closed curquiza closed 2 years ago

curquiza commented 3 years ago

Related to https://github.com/meilisearch/MeiliSearch/issues/1331

ahmedkrmn commented 2 years ago

Hi Clémentine, I've been trying to work on this issue. After some testing, I came to the following conclusion:

What do you suggest doing to fix this?

curquiza commented 2 years ago

Hello @ahmedkrmn thanks for your interest! 😁

@ManyTheFish can help you on this when he will have the time :)

ManyTheFish commented 2 years ago

Hello @ahmedkrmn are you sure that deunicoding Arabic script is a good thing to do? the sentence

المتعة والمرح في تعلم العربية

would be deunicoded as

lmt`@ wlmrH fy t`lm l`rby@

🤔

I can't write Arabic script, so I don't know what should be the good behavior.

Reex11 commented 2 years ago

Hello @ManyTheFish, I believe that that the characters أ ا إ آ should be processed in a way similar to "lowercasing". So when a user search for a query containing for example احمد he should be able to receive all these variations أحمد احمد إحمد آحمد.

ManyTheFish commented 2 years ago

Hello @Reex11, I will investigate your case, 🤔 I tried if the lowercase function of rust could help us but no: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=92777875ea531819640dfedea3d42395

Is there a name of this similar process to "lowercasing"? This could help me to find a library or a function that would do the job.

Thanks for your help 😁

Reex11 commented 2 years ago

Thank you for Investigating this. Actually, its not literally lowercasing 😅, Its basically a kind of normalization proccess. There is a new Arabic text processing library released recently Maha, it will be very useful as it has already done a lot of work in Arabic text normalization.

You need to know this first: ا letter is called Alef ء this symbol is called Hamza أ this letter is Alef with Hamza above

Now, this library is calling this process normalization which I believe is right. Here you can find Alef Variations And here you can find what is called Alef Variations Normalization

I'll dig around to see if there's anything else to consider.

Reex11 commented 2 years ago

Hi again, I found the following:

I'll lookup for a solution for Waw stopword. And I already have some workarounds in mind. I understand that you may face difficulties in understanding some parts of the languages. So, Let me know if you need any help.

ManyTheFish commented 2 years ago

Hello @Reex11! Thanks for your help, we have to design or find a specialized normalizer for this. I have a question about tokenization, are words only space-separated?

Reex11 commented 2 years ago

Hi @ManyTheFish, First, You should know that I have basic knowledge about NLP.

I think that there are a lot of cases that are not space-separated. But its ok to start with space-separation. ( and I believe that this is the general case in Arabic supported tokenizers I seen ) Although, There are some important and common conditions that need to be considered to improve the search results. Such as And => و , The => الـ

Example: الشجرة => The Tree is a combination of الـ and شجرة الـ is equivalent to The and its always connected (not space separated) to the next word.

I found a great Arabic NLP library, I think its the best so far. Its called CAMeL tools

curquiza commented 2 years ago

Closed in favor of https://github.com/meilisearch/product/discussions/139 Any contribution to add an Arabic normalizer and segmenter is welcomed!