meilisearch / charabia

Library used by Meilisearch to tokenize queries and documents
MIT License
258 stars 89 forks source link

Persian language support for normalization and segmentation #304

Open Ja7ad opened 2 months ago

Ja7ad commented 2 months ago

Hello

Thank you for your continuous efforts in maintaining and improving Charabia. I’m writing to request support for the Persian language in your normalization and segmentation modules, similar to the existing support for Arabic.

Background

Persian (Farsi) is a widely spoken language, using the same script as Arabic with some additional letters. Although Persian shares many similarities with Arabic, there are important differences in orthography, morphology, and syntax that require distinct handling for proper text processing, especially in tasks like tokenization, normalization, and segmentation.

Feature Request

I would like to request the addition of Persian language support for:

  1. Normalization:

    • Handling Persian-specific characters, such as "گ", "چ", "پ", "ژ".
    • Differentiating between Arabic and Persian diacritics and letters where applicable (e.g., "ی" vs. "ي", "ک" vs. "ك").
    • Normalizing Persian numerals (۰-۹) and ensuring compatibility with Arabic numerals where necessary.
  2. Segmentation:

    • Properly segmenting Persian text based on its unique grammatical structure.
    • Handling word boundaries and tokenization in the context of Persian, considering the language's syntax and morphology.

Screenshot from 2024-08-12 11-57-15 Screenshot from 2024-08-12 11-57-30

References

To aid in this implementation, here are the links to the current normalization and segmentation implementations for Arabic, which can serve as a starting point for Persian:

Conclusion

Implementing Persian language support would greatly benefit users who need to process Persian text accurately. Persian is distinct enough from Arabic that this feature would significantly improve text processing capabilities for Persian-speaking users. I’m happy to contribute in any way I can to support this effort.

Ja7ad commented 2 months ago

@curquiza @Kerollmops I have issue on implementation, whatlang don't support Persian script.

In Persian we have many unicodes, Arabic doesn't support it. for example:

https://www.unicode.org/charts/PDF/U0600.pdf

image image image image

I can't pass normalization test for this issue and whatlang don't support Persian script for this.

This repo is old and no have activity for add Persian script.

https://github.com/Ja7ad/charabia/commit/f9b58e02e4e701a7f1323a1bd5caa0c65f9974a9

https://github.com/Ja7ad/charabia/commit/029423a1a92b18cf9d5f724abdccbc84230a9826

I think better meilisearch make a fork of whatlang and update this crates.

ManyTheFish commented 2 months ago

Hello @Ja7ad, WhatLang doesn't support Persian script, but which script is assigned to Persian instead? Arabic? If I understand well, Arabic and Persian share a lot of characters; if that's the case, I'd like to consider them as the same script for Charabia. This would avoid splitting words into parts. If everything were considered Arabic, would it be relevant to apply this normalization to any Arabic Language?

Thank you for all the precision!

Ja7ad commented 2 months ago

Hello @Ja7ad, WhatLang doesn't support Persian script, but which script is assigned to Persian instead? Arabic? If I understand well, Arabic and Persian share a lot of characters; if that's the case, I'd like to consider them as the same script for Charabia. This would avoid splitting words into parts. If everything were considered Arabic, would it be relevant to apply this normalization to any Arabic Language?

Thank you for all the precision!

Some character in Persian is not support in Arabic, Please see attachment screenshot.

ManyTheFish commented 2 months ago

Yes, I understood that, however, the technical approach of Charabia is a simplification of the real linguistical state of Languages. For instance, the characters you listed before are considered Arabic by Charabia even if it's not exactly true. But considering Persian and Arabic as the same script is convenient if they share a lot of common characters.

For instance Chinese and Japanese are completely different but share some characters, the Kanjies. This forces Charabia to have a "virtual" script Cj containing both scripts, avoiding splitting a word in 2 because it contains different scripts.

The real question on my side is, should we normalize some Persian characters that are used in Arabic Language that shouldn't be normalized if they were used in an Arabic context?

If yes, is Persian a Language or a Script? If no, normalizing your character anyway should work

Ja7ad commented 2 months ago

Yes, I understood that, however, the technical approach of Charabia is a simplification of the real linguistical state of Languages. For instance, the characters you listed before are considered Arabic by Charabia even if it's not exactly true. But considering Persian and Arabic as the same script is convenient if they share a lot of common characters.

For instance Chinese and Japanese are completely different but share some characters, the Kanjies. This forces Charabia to have a "virtual" script Cj containing both scripts, avoiding splitting a word in 2 because it contains different scripts.

The real question on my side is, should we normalize some Persian characters that are used in Arabic Language that shouldn't be normalized if they were used in an Arabic context?

If yes, is Persian a Language or a Script? If no, normalizing your character anyway should work

Yes it's Persian language Even segmentation is different.

kamiyn commented 2 months ago

I agree with this issue.

Arabic and Persian use many of the same letters, but they are quite different languages. They belong to different language families ( https://en.wikipedia.org/wiki/Language_family ) and their grammar is completely different.

Persian has grammar that is closer to European languages than Arabic.

As a Japanese person, I feel that the difference between Persian and Arabic is similar to one between Japanese and Chinese.