tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
834 stars 888 forks source link

Add Extended Arabic-Indic Digits to Persian, Urdu and Sindhi #72

Closed Shreeshrii closed 7 years ago

Shreeshrii commented 7 years ago

Add 0-9 and

Perso-Arabic variant ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹

for Persian, Urdu and Sindhi

Please see https://github.com/tesseract-ocr/tesseract/issues/858

Shreeshrii commented 7 years ago

https://github.com/tesseract-ocr/tesseract/issues/894

The rightmost column in image has 2 digit numbers, but most of the time only one digit seems to be recognized.

theraysmith commented 7 years ago

I've added them to my copy of desired_characters. I'll push them to github after testing. Anyone know which digits are needed for the other Arabic languages? kur_ara, pus, uig

reza1615 commented 7 years ago

@theraysmith https://en.wikipedia.org/wiki/Modern_Arabic_mathematical_notation#Variations and https://en.wikipedia.org/wiki/Eastern_Arabic_numerals#Numerals

ebraminio commented 7 years ago

Kurdish with Arabic script (kur) uses Arabic-Indic (١٢٣٤٥٦٧٨٩), Pashto (pus) uses either same with Persian (۱۲۳۴۵۶۷۸۹) or West Arabic (a.k.a European, 123456789), Uighur (uig) uses European.

There is a solution that you check by your own which language uses what digits, open your browser console and enter these, each line separately (needs two letters code, not three letters which tesseract uses):

(123456.789).toLocaleString('ckb') // ١٢٣٬٤٥٦٫٧٨٩ (Arabic-Indic)
(123456.789).toLocaleString('ug') // 123,456.789
(123456.789).toLocaleString('ps') // Interesting that Safari gives "۱۲۳٬۴۵۶٫۷۸۹" (Extended Arabic-Indic similar to Persian) but Chrome "123,456.789"

Please note that Urdu text may use digits with same unicode with Persian but with different appearance (but European style digits seems nowadays are used more often with Urdu), open this on your browser (Urdu appearance of Arabic-Indic extended digits):

data:text/html;charset=utf8,<div lang="ur" style="font-family: Arial; font-size: 400%">۱۲۳۴۵۶۷۸۹

and compare it with (default, and Persian appearance of Arabic-Indic extended digits):

data:text/html;charset=utf8,<div style="font-family: Arial; font-size: 400%">۱۲۳۴۵۶۷۸۹

Same Unicode but different appearance. Opentype, more accurately, a font able to handle opentype language tag feature, handles this magic and Pango, which you use for creation of training dataset for tesseract, is able to handle this for you if language code is passed correctly.

roozgar commented 7 years ago

in persian ziro to nine is listed correctly also "," is used for digit separation...

Shreeshrii commented 7 years ago

Thank you all for your helpful input.

theraysmith commented 7 years ago

+1 I've updated the desired_characters and the next training will use the correct digits. I'm implementing the same solution for vowels/points as Hebrew, so it should improve recognition of words with them. The difficulty is that Arabic seems a lot more complex than Hebrew because there are many languages that use different variants of the script with different characters, as well as the different display styles. I'm not sure about how that affects the use of point/vowels, or whether there are vowels that are unique to the different languages.

On Tue, Aug 8, 2017 at 8:27 PM, Shreeshrii notifications@github.com wrote:

Thank you all for your helpful input.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/72#issuecomment-321142454, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056URNReHYbDIGmtnZ3SMZsNhcb3uMks5sWScagaJpZM4NN05u .

-- Ray.

reza1615 commented 7 years ago

@theraysmith 1- here is listed all arabic family characters. I check the table plus numbers there are some other similar characters which have different Unicode:

ۀ = \u06C0 ۂ =\u06C2 هٔ = \u0647 + \u0654

إ =\u0625 ٳ =\u0673

ٲ =\u0672 أ =\u0623 ٵ =\u0675

، =\u060C ٬ =\u066C ٫ =\u066B

064E 0659

ڼ =\u06BC ڹ=\u06B9

06EC 06E0 06F0 0660 06DF 06EB 06EA . = (dot)

0674 0655 0654 065F 0621

٭ =\u066D

you can check their Unicode at here 2-at http://collation-charts.org/icu442/ there is list of many languages and their official characters (you can find Persian, Pashto, Arabic, ...) separately like 3- vowels (main vowels Unicode = [\u064B-\u0650\u0652\u0670] ) have unique Unicode for all member of the Arabic family.

gheyret commented 7 years ago

Uyghur(Uighur) language uses 0123456789 digits.

amitdo commented 3 years ago

This issue should be re-opened.