Closed Shreeshrii closed 7 years ago
https://github.com/tesseract-ocr/tesseract/issues/894
The rightmost column in image has 2 digit numbers, but most of the time only one digit seems to be recognized.
I've added them to my copy of desired_characters. I'll push them to github after testing. Anyone know which digits are needed for the other Arabic languages? kur_ara, pus, uig
Kurdish with Arabic script (kur) uses Arabic-Indic (١٢٣٤٥٦٧٨٩), Pashto (pus) uses either same with Persian (۱۲۳۴۵۶۷۸۹) or West Arabic (a.k.a European, 123456789), Uighur (uig) uses European.
There is a solution that you check by your own which language uses what digits, open your browser console and enter these, each line separately (needs two letters code, not three letters which tesseract uses):
(123456.789).toLocaleString('ckb') // ١٢٣٬٤٥٦٫٧٨٩ (Arabic-Indic)
(123456.789).toLocaleString('ug') // 123,456.789
(123456.789).toLocaleString('ps') // Interesting that Safari gives "۱۲۳٬۴۵۶٫۷۸۹" (Extended Arabic-Indic similar to Persian) but Chrome "123,456.789"
Please note that Urdu text may use digits with same unicode with Persian but with different appearance (but European style digits seems nowadays are used more often with Urdu), open this on your browser (Urdu appearance of Arabic-Indic extended digits):
data:text/html;charset=utf8,<div lang="ur" style="font-family: Arial; font-size: 400%">۱۲۳۴۵۶۷۸۹
and compare it with (default, and Persian appearance of Arabic-Indic extended digits):
data:text/html;charset=utf8,<div style="font-family: Arial; font-size: 400%">۱۲۳۴۵۶۷۸۹
Same Unicode but different appearance. Opentype, more accurately, a font able to handle opentype language tag feature, handles this magic and Pango, which you use for creation of training dataset for tesseract, is able to handle this for you if language code is passed correctly.
in persian ziro to nine is listed correctly also "," is used for digit separation...
Thank you all for your helpful input.
+1 I've updated the desired_characters and the next training will use the correct digits. I'm implementing the same solution for vowels/points as Hebrew, so it should improve recognition of words with them. The difficulty is that Arabic seems a lot more complex than Hebrew because there are many languages that use different variants of the script with different characters, as well as the different display styles. I'm not sure about how that affects the use of point/vowels, or whether there are vowels that are unique to the different languages.
On Tue, Aug 8, 2017 at 8:27 PM, Shreeshrii notifications@github.com wrote:
Thank you all for your helpful input.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/72#issuecomment-321142454, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056URNReHYbDIGmtnZ3SMZsNhcb3uMks5sWScagaJpZM4NN05u .
-- Ray.
@theraysmith 1- here is listed all arabic family characters. I check the table plus numbers there are some other similar characters which have different Unicode:
ۀ = \u06C0 ۂ =\u06C2 هٔ = \u0647 + \u0654
إ =\u0625 ٳ =\u0673
ٲ =\u0672 أ =\u0623 ٵ =\u0675
، =\u060C ٬ =\u066C ٫ =\u066B
064E 0659
ڼ =\u06BC ڹ=\u06B9
06EC 06E0 06F0 0660 06DF 06EB 06EA . = (dot)
0674 0655 0654 065F 0621
٭ =\u066D
you can check their Unicode at here 2-at http://collation-charts.org/icu442/ there is list of many languages and their official characters (you can find Persian, Pashto, Arabic, ...) separately like 3- vowels (main vowels Unicode = [\u064B-\u0650\u0652\u0670] ) have unique Unicode for all member of the Arabic family.
Uyghur(Uighur) language uses 0123456789 digits.
This issue should be re-opened.
Add 0-9 and
Perso-Arabic variant ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹
for Persian, Urdu and Sindhi
Please see https://github.com/tesseract-ocr/tesseract/issues/858