meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents
MIT License
257 stars 88 forks source link

Normalization Issue for Turkish Characters in Charabia #316

Open niyazialpay opened 5 days ago

niyazialpay commented 5 days ago

Hello everyone,

I previously opened a record on this issue. It was mentioned that it was fixed with this pull request: https://github.com/meilisearch/charabia/pull/305#issuecomment-2410540235 . I waited for the update to be released, but since it didn't come, I downloaded it from GitHub and checked it by running it with Docker. However, the problem has not been resolved.

There is a normalization issue in Charabia when processing Turkish characters. Turkish has several unique characters such as "ç", "ğ", "ı", "İ", "ö", "ş", "ü" which need to be normalized correctly for accurate text processing and search indexing. Currently, these characters are not being normalized correctly, which leads to inaccuracies in search results and tokenization.

Steps to Reproduce:

Use Charabia to tokenize and normalize a text containing Turkish characters. Compare the results with the expected normalized form of Turkish characters. Example Text:

Original Text: "çalışma, günlük, İstanbul, İstasyon, ömür, şarkı, ütü" Expected Normalized Form: "calisma, gunluk or ğunluk, istanbul, istasyon, omur, sarki, utu" Current Behavior:

The Turkish characters are not normalized to their correct forms, leading to inconsistencies in search results. Expected Behavior:

Turkish characters should be normalized as follows:

"ç" -> "c" "ğ" -> "g" "ı" -> "i" "I" -> "ı" "İ" -> "i" "İ" -> "I" "ö" -> "o" "ş" -> "s" "ü" -> "u"

Impact:

This issue affects the accuracy of search results and the effectiveness of tokenization for Turkish text. It is crucial for Charabia to handle these characters correctly to support Turkish language text processing adequately.

Proposed Solution:

Implement a normalization rule for Turkish characters in Charabia. Ensure that the normalization process correctly transforms Turkish characters to their expected forms.

To assist you better, I'm also sharing the dump of the data I'm using. https://depo.niyazialpay.com/20240827-141437507.dump

References:

image

image

image

image

Thank you for addressing this issue. Accurate normalization for Turkish characters will significantly improve the performance and reliability of Charabia for Turkish language text processing.

ManyTheFish commented 4 days ago

Hello @niyazialpay, I see you are using v1.8.0, which is prior to the fix. Could you retry with v1.11.0-rc.1?

I suggested trying it in the following comments: https://github.com/meilisearch/charabia/issues/294#issuecomment-2413752147

Did you have any issues with it?

niyazialpay commented 4 days ago

You are seeing the images like this because I copied them from an old issue record. I cloned the repository from Github, built version 1.11 locally using docker build, ran it, and when I imported the data, I saw it worked the same way. I've updated the images with the current ones.