jacklicn / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Result after combine_tessdata on CentOS. How to check if tesseract was trained ? #520

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
I'm doing all steps for training mode : 

1. tesseract eng.couriernew.exp12.tif eng.couriernew.exp12 batch.nochop makebox
2. tesseract eng.couriernew.exp12.tif eng.couriernew.exp12 nobatch box.train
3. unicharset_extractor eng.couriernew.exp12.box
4. cp unicharset eng.unicharset
5. echo couriernew 0 0 0 0 0 > font_properties
6. mftraining -F font_properties -U eng.unicharset eng.couriernew.exp12.tr
7. mftraining -F font_properties -U eng.unicharset -O eng.unicharset 
eng.couriernew.exp12.tr
8. cntraining eng.couriernew.exp12.tr
9. mv Microfeat eng.Microfeat
10. mv normproto eng.normproto
11. mv pffmtable eng.pffmtable
12. mv mfunicharset eng.mfunicharset
13. mv inttemp eng.inttemp
14. combine_tessdata eng. 

What is the expected output? What do you see instead?
In first box file, generated by first command (e.g. 1.) I have some bad 
characters. I edited with Cowboxer and then I go to the second step - train 
command (e.g. 2.). After I run all commands (e.g. 2. -> 14.) I check the output 
with this command : 
tesseract eng.couriernew.exp12.tif eng.couriernew.exp12_sample3 -l eng

But I had the same result, with bad characters. 

Notice:

After "combine_tessdata eng." command, on the screen appear this :

Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is 1426
Offset for type 4 is 308300
Offset for type 5 is 308487
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1

How can I can check if I trained the Tesseract ?
I'm using Tesseract 3.01 on CentOS.

Please use labels and text to provide additional information.
I attached on this issue some files and the images, but in JPEG format because 
in TIF it was too large.

Thank you in advance !

Original issue reported on code.google.com by simion.zafiu on 18 Jul 2011 at 12:25

Attachments:

GoogleCodeExporter commented 9 years ago
"combine_tessdata eng." will create eng.traineddata. Put this file to your 
tessdata directory. If you do not want to replace original eng.traineddata than 
rename eng.traineddata to something else (e.g. eng1.traineddata)

Than you can test your training:
tesseract eurotext.tif eurotext -l eng1

If you have questions - please use tesseract forum. 
http://code.google.com/p/tesseract-ocr/wiki/ReadMe#Support

Original comment by zde...@gmail.com on 11 Aug 2011 at 1:27