itwood / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Tamil - support classical orthography #1359

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?

Please provide any additional information below.

Suggestion for 3.04

"The Tamil script has a classical orthography in which the vowel sign -aa 
ligates with the consonants ண NNA ன NNNA ற RRA and the vowel sign -ai 
ligates with the consonants ண NNA ன NNNA ல LA ள LLA. 

Thus ணா னா றா (also seen in the right part of ணொ னொ றொ 
ணோ னோ றோ) and ணை னை லை ளை would be presented 
differently. "

ref: https://bugzilla.redhat.com/show_bug.cgi?id=795327

Please include 'Lohit Tamil Classical' in addition to Lohit Tamil font in the 
training set.

https://fedorahosted.org/releases/l/o/lohit/lohit-tamil-classical-ttf-2.5.3.tar.
gz

https://fedorahosted.org/releases/l/o/lohit/lohit-tamil-ttf-2.5.3.tar.gz

Original issue reported on code.google.com by shreeshrii on 30 Oct 2014 at 8:16

GoogleCodeExporter commented 9 years ago
Also see
https://groups.google.com/forum/#!topic/mintamil/ff_DwxkAKGw
and
https://docs.google.com/viewer?a=v&pid=forums&srcid=MDQ3MTE4NTA5MTY2NjkwMTk0NTQB
MDc5MTgzMzQ2NDcxNDE4MTAyMTABVmlyMVgtTThTX1VKATAuMwEBdjI&authuser=0

about 'Training Tesseract OCR; Lohit. e-Tamil OTC and code2000 fonts issue; 
kazhagam wordlist'

Original comment by shreeshrii on 30 Oct 2014 at 8:21