Closed GoogleCodeExporter closed 9 years ago
Here's the final product: zzz.traineddata
and the sample file containing "0..9" in OCRA font: zzz.ocra.exp0.png
Original comment by jlpool...@gmail.com
on 19 Feb 2012 at 4:11
Attachments:
Here are the intermediary files
Original comment by jlpool...@gmail.com
on 19 Feb 2012 at 4:26
Attachments:
The error seems to be with the pffmtable that was generated:
00000000 0b 00 00 00 00 00 48 00 41 00 50 00 4e 00 3e 00 |......H.A.P.N.>.|
00000010 4f 00 45 00 3e 00 50 00 47 00 4e 55 4c 4c 20 30 |O.E.>.P.G.NULL 0|
00000020 0a 30 20 37 32 0a 6c 20 36 35 0a 32 20 38 30 0a |.0 72.l 65.2 80.|
00000030 33 20 37 38 0a 34 20 36 32 0a 35 20 37 39 0a 36 |3 78.4 62.5 79.6|
00000040 20 36 39 0a 37 20 36 32 0a 38 20 38 30 0a 39 20 | 69.7 62.8 80.9 |
00000050 37 31 0a |71.|
Tesseract fails when trying to read it:
When trying to read it,
(gdb) bt
#0 tesseract::Classify::ReadNewCutoffs (this=0x7ffff730a020,
CutoffFile=0x7ffff74802a0, swap=false, end_offset=135401,
Cutoffs=0x7ffff7314020)
at third_party/tesseract/classify/cutoffs.cpp:76
#1 0x0000000000541ee4 in tesseract::Classify::InitAdaptiveClassifier
(this=0x7ffff730a020, load_pre_trained_templates=true)
at third_party/tesseract/classify/adaptmatch.cpp:571
#2 0x0000000000526ec8 in tesseract::Wordrec::program_editup
(this=0x7ffff730a020, textbase=0x7ffff7306668 "out", init_classifier=true,
init_dict=true)
at third_party/tesseract/wordrec/tface.cpp:56
(gdb) p Class
$4 =
"\000\000\000\000\000H\000A\000P\000N\000>\000O\000E\000>\000P\000G\000NULL\000"
Will need to dive in further.
Original comment by david.e...@gmail.com
on 26 Feb 2012 at 8:09
downloaded zzz.ocra.exp0.png and tested under version 3.02 Generated
zzz.traineddata under version 3.02 - attached files please see outputtext
"testocra.txt"wherein it is displayed the output as "0123456789" correctly.
Even attached zzz.unicharset file for reference. tested in winXP(sp3)
Original comment by withbles...@gmail.com
on 26 Feb 2012 at 10:05
Attachments:
I updated my Subversion pull of Tesseract
(/usr/local/src/tesseract-ocr-read-only) to build 681.
I ran in /usr/local/src/tesseract-ocr-read-only
make clean
./autogen
./configure
make
make install
ldconf
Then I built a fresh workspace, "samples_b681" and went through the steps of
building the training file in the new workspace:
/home/jlpoole/work/tess/samples_b681.
My test resulted in:
jlpoole@themis ~/work/tess/samples_b681 $ tesseract num.ocra.exp0.png output -l
num
tesseract: unicharmap.cpp:105: bool UNICHARMAP::contains(const char*) const:
Assertion `*unichar_repr != '\0'' failed.
Aborted
jlpoole@themis ~/work/tess/samples_b681 $
I compared the zzz.unicharset from Comment #3 with num.unicharset and found an
entry that looks malformed:
jlpoole@themis ~/work/tess/samples_b681 $ cat num.unicharset
11
NULL 0 NULL 0
0 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0 # # 0 [30 ]0
l 3 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # # l [6c ]a
2 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0 # # 2 [32 ]0
3 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 4 0 0 # # 3 [33 ]0
4 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 5 0 0 # # 4 [34 ]0
5 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 6 0 0 # # 5 [35 ]0
6 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 7 0 0 # # 6 [36 ]0
7 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 8 0 0 # # 7 [37 ]0
8 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 9 0 0 # # 8 [38 ]0
9 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 10 0 0 # # 9 [39 ]0
jlpoole@themis ~/work/tess/samples_b681 $
Moreover, the pffmtable looks to have the same problem referenced in Comment
#2. Here's what I generated under build 681:
jlpoole@themis ~/work/tess/samples_b681 $ cat num.pffmtable
HAPN>OE>PGNULL 0
0 72
l 65
2 80
3 78
4 62
5 79
6 69
7 62
8 80
9 71
jlpoole@themis ~/work/tess/samples_b681 $
I am attaching my workspace that has all the files.
The only thing that possibly could be affecting my outcome is a cache of other
tesseract builds I installed previously; I've been trying several versions of
tesseract to see if the problem I had is related to a version change. I am
assuming that the "make install" for build 681 would cleanly install Build 681
overwriting anything that it needs to.
I'm attached a tar of my testing workbench as well as an HTML file I have as a
work-in-progress to document how to train for a font (this should be helpful to
others similarly situated and not intimately familiar with Tesseract).
Original comment by jlpool...@gmail.com
on 26 Feb 2012 at 7:32
Attachments:
Remembering that my goal here is to OCR some already-printed text, I tried
Comment #4's training file and confirmed that I obtained correct output of:
0123456789 against the sample PNG. I hoping that the training of 10 digits
will suffice; and now I'll move onto testing against some real life samples. I
just wanted to say thank you for making your traindata available -- I
(hopefully) can now move forward with the main project.
Original comment by jlpool...@gmail.com
on 26 Feb 2012 at 8:16
An aside:
For anyone interested in comparing Tesseract with gocr, here's a comparison of
tesseract (trained with OCRA per comment #4 above) with gocr for 30 samples
very similiar to the PNG posted above: small image files containing only a
sequence of numbers.
# Tesseract gocr gocr match
0 105884 105884 t
1 67110 67_10 f
2 7524 7524 t
3 6212 6212 t
4 84326 _432_ f
5 84325 8432S f
6 221064 221D64 f
7 219704 2_9704 f
8 111544 _11S44 f
9 231636 231636 t
10 124780 1247_0 f
11 41452 41452 t
12 41456 4_4S6 f
13 123298 123298 t
14 123299 1232__ f
15 123283 123283 t
16 42065 _2065 f
17 127916 _279_6 f
18 231638 23_638 f
19 1620 1620 t
20 29242 29242 t
21 29239 29239 t
22 123253 _232S3 f
23 114153 1_4_53 f
24 20318 203_8 f
25 102950 102950 t
26 79625 79_25 f
27 75994 75994 t
28 215187 2_5187 f
29 51165 511_5 f
100.00% 40.00%
Original comment by jlpool...@gmail.com
on 26 Feb 2012 at 10:58
I try to reproduce problem on openSUSE 12.1 64bit (with tesseract 3.02).
When I used you your zzz.traineddata I got the same error as you.
So I run training with this steps (with your png+box):
$ tesseract zzz.ocra.exp0.png zzz.ocra.exp0 nobatch box.train
$ unicharset_extractor zzz.ocra.exp0.box
$ echo "ocra 0 0 1 0 0" >font_properties
$ shapeclustering -F font_properties -U unicharset zzz.ocra.exp0.tr
$ mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr
$ cntraining zzz.ocra.exp0.tr
$ cp normproto zzz.normproto
$ cp inttemp zzz.inttemp
$ cp pffmtable zzz.pffmtable
$ cp shapetable zzz.shapetable
$ sudo cp zzz.traineddata /usr/local/share/tessdata/
$ tesseract zzz.ocra.exp0.png output -l zzz
and it works!
So I unpacked your and my traineddata and compared them with with kdiff3 (it
for comparing text files, but it show also if there are differences in binary
files) with this result:
1. you miss shapetable file (but I do nothing this is problem)
2. there is difference in inttemp file.
Other files are the same. So I would expect problem is in inttemp file.
My zzz.traineddata is in attachment.
Original comment by zde...@gmail.com
on 27 Feb 2012 at 9:43
Attachments:
I had not been creating the shapetable, that was a problem.
I updated tesseract to version 684 (today's Subversion high watermark) and
successfully trained tesseract for the OCRA font. I have a Perl script which
facilitates the training process and will introduce it in a separate issue to
be filed herein.
Comment #8 helped me very much, thank you.
This issue may be closed or marked solved.
Original comment by jlpool...@gmail.com
on 28 Feb 2012 at 6:12
Original comment by zde...@gmail.com
on 28 Feb 2012 at 9:06
Original issue reported on code.google.com by
jlpool...@gmail.com
on 19 Feb 2012 at 3:59