Closed ibr123 closed 7 years ago
Ref:
The Arabic numeral glyphs 0–9 are encoded in ASCII and Unicode at positions 0x30 to 0x39, matching up with the second hexadecimal digit for convenience:
The Eastern Arabic numerals (also called Arabic–Indic numerals and Arabic Eastern numerals) are the symbols used to represent the Hindu–Arabic numeral system, in conjunction with the Arabic alphabet.
Each numeral in the Persian variant has a different Unicode point even if it looks identical to the Eastern Arabic numeral counterpart. However the variants used with Urdu, Sindhi, and other South Asian languages are not encoded separately from the Persian variants.
See U+0660 through U+0669 and U+06F0 through U+06F9.
So, basically, there are three unicode ranges with numerals used in Arabic, Persian etc.
If the fonts are putting Eastern Arabic numerals
U+0660 through U+0669 in the Arabic numerals
range of 0x30 to 0x39, that would cause confusion during training.
https://github.com/tesseract-ocr/langdata/blob/master/ara/ara.training_text has 'Arabic numerals' range of 0x30 to 0x39. You can check whether it as ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩) and add it, if you want to include it for training.
If that numerals are indeed missing from the official traineddata, I suggest to open a new issue in the langdata repo.
Did Anyone fix this problem? I am not using Unix in order to be able to train tesseract on new data, but I need to use the Eastern arabic numerals. if someone fixed it and has the traineddata file, please share it with us
Thanks
Persian's number's shape mostly the same as Arabic's but their Unicode is different! Persian numbers= ۹ ۸ ۷ ۶ ۵ ۴ ۳ ۲ ۱ ۰ Arabic numbers = ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ Persian numbers' Unicode= \u06F9 \u06F8 \u06F7 \u06F6 \u06F5 \u06F4 \u06F3 \u06F2 \u06F1 \u06F0 Arabic numbers' Unicode =\u0660 \u0661 \u0662 \u0663 \u0664 \u0665 \u0666 \u0667 \u0668 \u0669 you can check them here
@reza1615
Are these getting recognized in the best traineddata? Are they being recognized as Arabic unicode numbers?
Yes, it mixed Persian with Arabic numbers (unicode) for example the image had these numbers ۱-۲ and it recognize ۱ as Persian number and ۲ as Arabic number their shape is the same but for searching and Unicode, they are different. in another hand 3 and 4 and 5 and 6's shape are not the same see below 6 5 4 3 ۶ ۵ ۴ ۳ >Persian ٣ ٤ ٥ ٦ > Arabic you can check it at here with the output txt file
for more information see Unicode Number, Decimal Digit' Category
@theraysmith Please update the desired characters for persian for the persian unicode range of numbers and ignore the unicode arabic number range for fas (persian), as mentioned above. Thanks!
usually, people use the un-standard keyboard (Arabic keyboard for typing Persian text) so there are many scan images of Persian's text which have Arabic numbers like ٣ ٤ ٥ ٦ but the OCR should convert them to Persian Unicode
Question from Ray in tesseract-ocr/langdata#72
Anyone know which digits are needed for the other Arabic languages? kur_ara, pus, uig
@zdenop, please close this issue.
The issue is related to the trained data. not code.
As said, the right place for this issue is the langdata repo. See https://github.com/tesseract-ocr/langdata/issues/71, https://github.com/tesseract-ocr/langdata/issues/72
Hi,
I'm using tesseract 4.00alpha with liptonica 1.74.1 on Ubuntu 14 to create LSTM files for multiple Arabic fonts, which some of them have the common numerical system, (1 2 3 4 ...) but some of these font contains the a different numerical system, which usually more common in the Arabic scripts, which are ( ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩) yet the last set of numbers were not recognize but as symbols such as ! instead of ١ ,are these numbers are not integrated in the tesseract? Thanks