tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.22k stars 9.51k forks source link

Arabic-Indic numerals #858

Closed ibr123 closed 7 years ago

ibr123 commented 7 years ago

Hi,

I'm using tesseract 4.00alpha with liptonica 1.74.1 on Ubuntu 14 to create LSTM files for multiple Arabic fonts, which some of them have the common numerical system, (1 2 3 4 ...) but some of these font contains the a different numerical system, which usually more common in the Arabic scripts, which are ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩) yet the last set of numbers were not recognize but as symbols such as ! instead of ١ ,are these numbers are not integrated in the tesseract? Thanks

Shreeshrii commented 7 years ago

Ref:

The Arabic numeral glyphs 0–9 are encoded in ASCII and Unicode at positions 0x30 to 0x39, matching up with the second hexadecimal digit for convenience:

The Eastern Arabic numerals (also called Arabic–Indic numerals and Arabic Eastern numerals) are the symbols used to represent the Hindu–Arabic numeral system, in conjunction with the Arabic alphabet.

Each numeral in the Persian variant has a different Unicode point even if it looks identical to the Eastern Arabic numeral counterpart. However the variants used with Urdu, Sindhi, and other South Asian languages are not encoded separately from the Persian variants.

See U+0660 through U+0669 and U+06F0 through U+06F9.

So, basically, there are three unicode ranges with numerals used in Arabic, Persian etc.

If the fonts are putting Eastern Arabic numerals U+0660 through U+0669 in the Arabic numerals range of 0x30 to 0x39, that would cause confusion during training.

https://github.com/tesseract-ocr/langdata/blob/master/ara/ara.training_text has 'Arabic numerals' range of 0x30 to 0x39. You can check whether it as ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩) and add it, if you want to include it for training.

amitdo commented 7 years ago

If that numerals are indeed missing from the official traineddata, I suggest to open a new issue in the langdata repo.

aboelmor commented 7 years ago

Did Anyone fix this problem? I am not using Unix in order to be able to train tesseract on new data, but I need to use the Eastern arabic numerals. if someone fixed it and has the traineddata file, please share it with us

Thanks

reza1615 commented 7 years ago

Persian's number's shape mostly the same as Arabic's but their Unicode is different! Persian numbers= ۹ ۸ ۷ ۶ ۵ ۴ ۳ ۲ ۱ ۰ Arabic numbers = ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ Persian numbers' Unicode= \u06F9 \u06F8 \u06F7 \u06F6 \u06F5 \u06F4 \u06F3 \u06F2 \u06F1 \u06F0 Arabic numbers' Unicode =\u0660 \u0661 \u0662 \u0663 \u0664 \u0665 \u0666 \u0667 \u0668 \u0669 you can check them here

Shreeshrii commented 7 years ago

@reza1615

Are these getting recognized in the best traineddata? Are they being recognized as Arabic unicode numbers?

reza1615 commented 7 years ago

Yes, it mixed Persian with Arabic numbers (unicode) for example the image had these numbers ۱-۲ and it recognize ۱ as Persian number and ۲ as Arabic number their shape is the same but for searching and Unicode, they are different. in another hand 3 and 4 and 5 and 6's shape are not the same see below 6 5 4 3 ۶ ۵ ۴ ۳ >Persian ٣ ٤ ٥ ٦ > Arabic you can check it at here with the output txt file

reza1615 commented 7 years ago

for more information see Unicode Number, Decimal Digit' Category

Shreeshrii commented 7 years ago

@theraysmith Please update the desired characters for persian for the persian unicode range of numbers and ignore the unicode arabic number range for fas (persian), as mentioned above. Thanks!

reza1615 commented 7 years ago

usually, people use the un-standard keyboard (Arabic keyboard for typing Persian text) so there are many scan images of Persian's text which have Arabic numbers like ٣ ٤ ٥ ٦ but the OCR should convert them to Persian Unicode

Shreeshrii commented 7 years ago

Question from Ray in tesseract-ocr/langdata#72

Anyone know which digits are needed for the other Arabic languages? kur_ara, pus, uig

amitdo commented 7 years ago

@zdenop, please close this issue.

The issue is related to the trained data. not code.

As said, the right place for this issue is the langdata repo. See https://github.com/tesseract-ocr/langdata/issues/71, https://github.com/tesseract-ocr/langdata/issues/72