mithilesh1125 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Training Tesseract 3.01 Result Problem #652

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.Combine_tessdata max.
2.
3.

What is the expected output? What do you see instead?
I produce all files correctly (I think), but the result of tesseract after 
training is worse than default eng.traineddata.
Why ? I can't fix the problem after did training two times.
Can you help me ?

What version of the product are you using? On what operating system?
I'm using Tesseract OCR 3.01 on Windows

Please provide any additional information below.
I attached my produced files.
I renamed my files for "max" lang to combine max.traineddata file

Original issue reported on code.google.com by maximili...@gmail.com on 15 Mar 2012 at 11:36

Attachments:

GoogleCodeExporter commented 9 years ago
Have your read the first steps on the wike-page?

----8<------8<---
Make sure there are a minimum number of samples of each character. 10 is good, 
but 5 is OK for rare characters.

There should be more samples of the more frequent characters - at least 20. 
----8<------8<---

Original comment by pe...@hhoefling.de on 15 Mar 2012 at 1:52

GoogleCodeExporter commented 9 years ago
I Will try to write a text with 20 samples of each caracter

Original comment by maximili...@gmail.com on 15 Mar 2012 at 2:25

GoogleCodeExporter commented 9 years ago
please produced the text file used for generating tif file for further testing. 
because output according to tif file does not produce correct in order 
eventhough there is no misspelling. - i feel something wrong with the tif file 
itself.

If the text file is uploaded and  I shall generate tif and its box file using 
Arial font  for testing purpose. 

Original comment by withbles...@gmail.com on 15 Mar 2012 at 4:27

GoogleCodeExporter commented 9 years ago
One issue is that currently the layout recognition phase of tesseract is 
returning 8 columns for the alphabet area (and skips the last "I R" column. And 
it therefore decides that the text is running top-to-bottom, rather than 
left-to-right.
See attached image: "police-Text Lines (RIL_TEXTLINE).png".

Unfortunately, using -psm 4 (PSM_SINGLE_COLUMN) crashes tesseract (see Issue 
653).

-psm 6 (PSM_SINGLE_BLOCK) does cause text rows to be used (see "police-Text 
Lines (RIL_TEXTLINE) PSM-6.png") and with the following OCR results:

   ABCIJEFGHI
   JKLMNUPUR
   STUVWXYZ
   123455789

This might be the result of the tif saying its 96DPI and therefore 16.67 sq 
inches? That's pretty big. However, changing the DPI to 300DPI or 600DPI 
doesn't seem to fix things?

The layout is correctly finding the characters (see "police-Connected 
Components (RIL_SYMBOL) PSM-6.png"). I'm not sure why it decides to split the D 
into I & J.

Possibly the relatively poor OCR is because these aren't "words" but single, 
separated letters. You might have to use PSM_SINGLE_CHAR mode with each of 
boxes returned by TessBaseAPI::GetConnectedComponents().

Original comment by tomp2...@gmail.com on 16 Mar 2012 at 5:03

Attachments:

GoogleCodeExporter commented 9 years ago
Hello,
I try with a new .tif file with many characters (plaque.tif).
The result is better but not very good (source image for testing : test.tif and 
result : test.txt)
Can i do something better ?

Thanks you,

Original comment by maximili...@gmail.com on 21 Mar 2012 at 9:18

Attachments:

GoogleCodeExporter commented 9 years ago
No issue for my problem ?

Original comment by maximili...@gmail.com on 25 Mar 2012 at 10:01

GoogleCodeExporter commented 9 years ago
Did you tried 3.02? Can you post plaque.box file?

Original comment by zde...@gmail.com on 3 Jan 2013 at 10:14

GoogleCodeExporter commented 9 years ago
Closed because of missing input of issue reporter.

Original comment by zde...@gmail.com on 20 Dec 2013 at 10:57