kcobra / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Indic - include 0-9 numbers and general punctuation #1358

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?

Please provide any additional information below.

Suggested for 3.04

Please ensure that all Indic languages have support for numbers 0-9 in addition 
to numerals in their script.

Also ensure support for general punctuation such as 
, . ? ! - _ : ; " '  etc

Original issue reported on code.google.com by shreeshrii on 30 Oct 2014 at 8:10

GoogleCodeExporter commented 9 years ago
Done all except Sinhala. It doesn't seem to have its own digits, apart from 
sets of archaic digits that were only just introduced into unicode.
Do you know if it uses digits from any of the other scripts now or just ASCII?

Original comment by theraysm...@gmail.com on 4 Nov 2014 at 11:26

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
You are right regarding Sinhala. It seems to be using only the ASCII digits 
currently.

Even though a set of digits were defined as part of 
http://www.unicode.org/charts/PDF/U0D80.pdf they do NOT seem to be used. 
http://www.unicode.org/charts/PDF/U111E0.pdf have been defined in Unicode7 but 
no font support yet.

More info at http://en.wikipedia.org/wiki/Sinhala_numerals

Original comment by shreeshrii on 5 Nov 2014 at 2:38

GoogleCodeExporter commented 9 years ago
This seems to be fixed in the langdata repo in github, except for Sinhala. If 
there's still an issue, please file it on github. (Also, as I've synced with 
your version of the langdata repo, if you mention which branch to look at, I'll 
merge it into a pull request)

Original comment by joregan on 14 May 2015 at 12:23