Open Ja7ad opened 3 months ago
@curquiza @Kerollmops I have issue on implementation, whatlang don't support Persian script.
In Persian we have many unicodes, Arabic doesn't support it. for example:
https://www.unicode.org/charts/PDF/U0600.pdf
I can't pass normalization test for this issue and whatlang don't support Persian script for this.
This repo is old and no have activity for add Persian script.
https://github.com/Ja7ad/charabia/commit/f9b58e02e4e701a7f1323a1bd5caa0c65f9974a9
https://github.com/Ja7ad/charabia/commit/029423a1a92b18cf9d5f724abdccbc84230a9826
I think better meilisearch make a fork of whatlang and update this crates.
Hello @Ja7ad, WhatLang doesn't support Persian script, but which script is assigned to Persian instead? Arabic? If I understand well, Arabic and Persian share a lot of characters; if that's the case, I'd like to consider them as the same script for Charabia. This would avoid splitting words into parts. If everything were considered Arabic, would it be relevant to apply this normalization to any Arabic Language?
Thank you for all the precision!
Hello @Ja7ad, WhatLang doesn't support Persian script, but which script is assigned to Persian instead? Arabic? If I understand well, Arabic and Persian share a lot of characters; if that's the case, I'd like to consider them as the same script for Charabia. This would avoid splitting words into parts. If everything were considered Arabic, would it be relevant to apply this normalization to any Arabic Language?
Thank you for all the precision!
Some character in Persian is not support in Arabic, Please see attachment screenshot.
Yes, I understood that, however, the technical approach of Charabia is a simplification of the real linguistical state of Languages. For instance, the characters you listed before are considered Arabic by Charabia even if it's not exactly true. But considering Persian and Arabic as the same script is convenient if they share a lot of common characters.
For instance Chinese and Japanese are completely different but share some characters, the Kanjies. This forces Charabia to have a "virtual" script Cj
containing both scripts, avoiding splitting a word in 2 because it contains different scripts.
The real question on my side is, should we normalize some Persian characters that are used in Arabic Language that shouldn't be normalized if they were used in an Arabic context?
If yes, is Persian a Language or a Script? If no, normalizing your character anyway should work
Yes, I understood that, however, the technical approach of Charabia is a simplification of the real linguistical state of Languages. For instance, the characters you listed before are considered Arabic by Charabia even if it's not exactly true. But considering Persian and Arabic as the same script is convenient if they share a lot of common characters.
For instance Chinese and Japanese are completely different but share some characters, the Kanjies. This forces Charabia to have a "virtual" script
Cj
containing both scripts, avoiding splitting a word in 2 because it contains different scripts.The real question on my side is, should we normalize some Persian characters that are used in Arabic Language that shouldn't be normalized if they were used in an Arabic context?
If yes, is Persian a Language or a Script? If no, normalizing your character anyway should work
Yes it's Persian language Even segmentation is different.
I agree with this issue.
Arabic and Persian use many of the same letters, but they are quite different languages. They belong to different language families ( https://en.wikipedia.org/wiki/Language_family ) and their grammar is completely different.
Persian has grammar that is closer to European languages than Arabic.
As a Japanese person, I feel that the difference between Persian and Arabic is similar to one between Japanese and Chinese.
Hello
Thank you for your continuous efforts in maintaining and improving Charabia. I’m writing to request support for the Persian language in your normalization and segmentation modules, similar to the existing support for Arabic.
Background
Persian (Farsi) is a widely spoken language, using the same script as Arabic with some additional letters. Although Persian shares many similarities with Arabic, there are important differences in orthography, morphology, and syntax that require distinct handling for proper text processing, especially in tasks like tokenization, normalization, and segmentation.
per
,fa
Feature Request
I would like to request the addition of Persian language support for:
Normalization:
Segmentation:
References
To aid in this implementation, here are the links to the current normalization and segmentation implementations for Arabic, which can serve as a starting point for Persian:
Conclusion
Implementing Persian language support would greatly benefit users who need to process Persian text accurately. Persian is distinct enough from Arabic that this feature would significantly improve text processing capabilities for Persian-speaking users. I’m happy to contribute in any way I can to support this effort.