meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents
MIT License
261 stars 89 forks source link

Normalization Issue for Turkish Characters in Charabia #294

Closed niyazialpay closed 3 months ago

niyazialpay commented 5 months ago

Hello everyone,

There is a normalization issue in Charabia when processing Turkish characters. Turkish has several unique characters such as "ç", "ğ", "ı", "İ", "ö", "ş", "ü" which need to be normalized correctly for accurate text processing and search indexing. Currently, these characters are not being normalized correctly, which leads to inaccuracies in search results and tokenization.

Steps to Reproduce:

Use Charabia to tokenize and normalize a text containing Turkish characters. Compare the results with the expected normalized form of Turkish characters. Example Text:

Original Text: "çalışma, günlük, İstanbul, İstasyon, ömür, şarkı, ütü" Expected Normalized Form: "calisma, gunluk or ğunluk, istanbul, istasyon, omur, sarki, utu" Current Behavior:

The Turkish characters are not normalized to their correct forms, leading to inconsistencies in search results. Expected Behavior:

Turkish characters should be normalized as follows:

"ç" -> "c" "ğ" -> "g" "ı" -> "i" "I" -> "ı" "İ" -> "i" "İ" -> "I" "ö" -> "o" "ş" -> "s" "ü" -> "u"

Impact:

This issue affects the accuracy of search results and the effectiveness of tokenization for Turkish text. It is crucial for Charabia to handle these characters correctly to support Turkish language text processing adequately.

Proposed Solution:

Implement a normalization rule for Turkish characters in Charabia. Ensure that the normalization process correctly transforms Turkish characters to their expected forms.

References:

image

image

image

image

Thank you for addressing this issue. Accurate normalization for Turkish characters will significantly improve the performance and reliability of Charabia for Turkish language text processing.

ManyTheFish commented 3 months ago

Hello @niyazialpay, @tkhshtsh0917 made a PR to fix this issue: Add Turkish normalizer, do you think the changes are sufficient to close this issue?

Thanks!

niyazialpay commented 3 months ago

Thank you, have these changes been implemented in the current 1.10.0 version, or should I wait for the new update? Because when I test it now, it still looks the same as before.

ManyTheFish commented 3 months ago

The changes will be integrated into the next Meilisearch version, v1.11.0. 😃 So no change in the current version so far

niyazialpay commented 1 month ago

Hello,

I’ve been waiting for version 1.11 for a while, but it hasn’t been released yet. When I took the current state from GitHub and ran it with Docker to test, I saw that the issue in the initial image I sent still persists. Could you please check it again? If you want, I can provide the relevant data dump.

https://depo.niyazialpay.com/20240827-141437507.dump

ManyTheFish commented 1 month ago

Hello @niyazialpay, the normalizer should be part of the v1.11 release in 2weeks, I've tried your dump with v1.11 and below is the results:

Capture d’écran 2024-10-15 à 14 31 08

Capture d’écran 2024-10-15 à 14 35 39

It looks good to me, am I wrong?

You can try the pre-release with the following docker image:

getmeili/meilisearch:v1.11.0-rc.1

let me know if it doesn't fit your expectations.

niyazialpay commented 1 month ago

Hello,

When I downloaded the current GitHub repository using git clone and built it with docker build, the result I got while testing version 1.11 unfortunately looks the same. However, when I tested with the Docker image getmeili/meilisearch:v1.11.0-rc.1 as you mentioned, the issue doesn't appear. So, what is the difference between these two? I see the version number 1.11 in both cases.

ManyTheFish commented 1 month ago

@niyazialpay, on which commit are you building Meilisearch?