Errors in Hindi OCR - Githubissues

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.running tesseract-ocr for hindi on attached files
2.
3.

What is the expected output? What do you see instead?

The expected output is correct OCR of the text. What is created is not fully 
correct. The expected output is enlosed for couple of files.

What version of the product are you using? On what operating system?

tesseract-ocr-setup-3.02.02.exe
on Windows 7 

Please provide any additional information below.

I used VietOCR.net 3.4 as the frontend gui.

The input text is mixed languages - hindi and sanskrit in some cases, both in 
devanagari scipt.

Let me know whether there is way to change the dictionary for additional terms 
and also whether we can help improve the trained data in any way.

Original issue reported on code.google.com by shreeshrii on 15 Mar 2013 at 8:29

Attachments:

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

http://research.ijcaonline.org/volume39/number6/pxc3877076.pdf

Shirorekha Chopping Integrated Tesseract OCR 
Engine for Enhanced Hindi Language Recognition

Is this approach already included in 3.02 version?

Original comment by shreeshrii on 19 Mar 2013 at 4:25

GoogleCodeExporter commented 9 years ago

http://eutypon.gr/eutypon/pdf/e2012-29/e29-a01.pdf
Training Tesseract for Ancient Greek OCR

useful info for training

Original comment by shreeshrii on 20 Mar 2013 at 6:12

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

OCR results are better when using text with words rather than just 
letter-combinations. See attached image file and corrected box file. 

Here are some of the errors and changes that had to be made in the box file:

जा 835 3147 869 3170 0 - आ 
ध 1851 3147 1871 3170 0 - घ
या 215 3059 242 3082 0 - षा
जा 1584 2971 1617 2994 0 - आ 
फा 2063 2971 2095 2994 0 - फ़ा
ज 2148 2971 2171 2994 0 - ज़
श्या 1156 2907 1174 2917 0 - maatraa from next line - deleted
५ 1244 2907 1257 2916 0 - maatraa from next line - deleted
पु 100 2875 133 2906 0 - सु - box size will require change
त 132 2883 148 2906 0 - ख -  box size will require change
दु 158 2875 183 2906 0 - दुः - box size will change
र 186 2883 200 2906 0
व 196 2883 212 2906 0 - ख - combine with line above
...

Original comment by shreeshrii on 26 Mar 2013 at 6:13

Attachments:

GoogleCodeExporter commented 9 years ago

I tried training for hindi using sanskrit2003 font. However, when using the 
generated hin.traineddata tesseract crashes with cube error.

box-tiff pairs files are attached.

Traning files are too big to attach here.

Original comment by shreeshrii on 5 Apr 2013 at 1:21

Attachments:

hin.sanskrit2003.boxtif.zip

GoogleCodeExporter commented 9 years ago

I tried to use the lohit font box/tif pairs provided in parichit project for 
Hindi.

The files had to be renamed hin.lohit.exp0.tif and .box instead of hin.lohit.tif
otherwise there was font-id error related to font_properties file.

Once that hurdle was passed, the files failed with the following error during 
shapeclustering.

4272: Distance = 0.024631: Distance = 0.024896: Stopped with 88 merged, min 
dist 0.025000
Master shape_table:Number of shapes = 2083 max unichars = 11 number with 
multiple unichars
Read shape table shapetable of 2083 shapes
Reading traindata\san.lohit.exp000.tr ...
Clustering error: Matrix inverse failed with error 1.80869
Clustering error: Matrix inverse failed with error 3.86638
Done!
Reading traindata\san.lohit.exp000.tr ...

Original comment by shreeshrii on 15 Apr 2013 at 12:00

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Generalization of Hindi OCR Using Adaptive
Segmentation and Font Files
Mudit Agrawal, Huanfeng Ma, and David Doermann
http://lampsrv02.umiacs.umd.edu/pubs/Papers/muditagrawal-09/muditagrawal-09.pdf

Original comment by shreeshrii on 4 Oct 2014 at 9:45

GoogleCodeExporter commented 9 years ago

Issue 1425 has been merged into this issue.

Original comment by zde...@gmail.com on 22 Feb 2015 at 9:35

mmoghimi / tesseract-ocr

Errors in Hindi OCR #871