michaelethompson / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Training Fails: unicharmap.cpp:105 Assertion `*unichar_repr != '\0'' failed. #629

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. create training file for a font for numbers 0..9
2. run tesseract against training sample using newly created "traineddata"

What is the expected output? 
Expected: successful termination of tesseract when applied against sample file.

What do you see instead?
Actual: (2 separate machines)

jlpoole@themis ~/work/tess/samples $ tesseract zzz.ocra.exp0.png output -l zzz  
tesseract: unicharmap.cpp:105: bool UNICHARMAP::contains(const char*) const: 
Assertion `*unichar_repr != '\0'' failed.
Aborted
jlpoole@themis ~/work/tess/samples $

jlpoole@hermes ~/work/tess/samples $ tesseract eng.ocra.exp1.png output
tesseract: unicharmap.cpp:105: bool UNICHARMAP::contains(const char*) const: 
Assertion `*unichar_repr != '\0'' failed.
Aborted
jlpoole@hermes ~/work/tess/samples $

What version of the product are you using? On what operating system?
On two separate machines (x86 & x86_64), using the same revision:
Revision: 675 (2/14/2012) 

jlpoole@themis /usr/local/src $ uname -a
Linux themis 2.6.39-gentoo-r3 #1 SMP Sat Sep 17 08:58:44 PDT 2011 i686 Intel(R) 
Core(TM) i5 CPU M 460 @ 2.53GHz GenuineIntel GNU/Linux
jlpoole@themis /usr/local/src $

jlpoole@hermes ~/work/tess/samples $ uname -a
Linux hermes 2.6.39.2 #7 SMP Wed Aug 10 05:51:40 PDT 2011 x86_64 AMD Phenom(tm) 
II X4 940 Processor AuthenticAMD GNU/Linux
jlpoole@hermes ~/work/tess/samples $

Please provide any additional information below.

See concurrent Issue 627.  Suspecting the problem I was running into might be 
related to an 64 bit operating system, I also went through the steps of 
creating a training file on a 32 bit box.  On the second attempt, I used a 
language construct of "zzz" instead of "eng" as I did on the first.   Here's a 
summary of my attempt to create a training file on the 32 bit system.

It may be I missed something in the instructions?  If so, the process of 
creating a training file does not alert me to a problem.  It is only a the 
final stage of running against the installed training file that something is 
amiss becomes apparent.

=================================================================

tesseract zzz.ocra.exp0.png zzz.ocra.exp0 batch.nochop makebox

    jlpoole@themis ~/work/tess/samples $ cat zzz.ocra.exp0.box
    D 35 53 61 94 0
    l 74 53 100 94 0
    E 114 53 140 94 0
    3 153 53 179 94 0
    H 194 53 218 94 0
    5 233 53 259 94 0
    E 273 53 299 94 0
    ? 312 53 338 94 0
    B 352 53 378 94 0
    H 392 53 418 94 0
    jlpoole@themis ~/work/tess/samples 
    jlpoole@themis ~/work/tess/samples $ nano zzz.ocra.exp0.box
    jlpoole@themis ~/work/tess/samples $ cat zzz.ocra.exp0.box
    0 35 53 61 94 0
    l 74 53 100 94 0
    2 114 53 140 94 0
    3 153 53 179 94 0
    4 194 53 218 94 0
    5 233 53 259 94 0
    6 273 53 299 94 0
    7 312 53 338 94 0
    8 352 53 378 94 0
    9 392 53 418 94 0
    jlpoole@themis ~/work/tess/samples $

tesseract zzz.ocra.exp0.png zzz.ocra.exp0 nobatch box.train

    jlpoole@themis ~/work/tess/samples $ tesseract zzz.ocra.exp0.png zzz.ocra.exp0 nobatch box.train
    Tesseract Open Source OCR Engine v3.02 with Leptonica
    APPLY_BOXES:
       Boxes read from boxfile:      10
       Found 10 good blobs.
    TRAINING ... Font name = ocra
    Generated training data for 1 words
    jlpoole@themis ~/work/tess/samples $

unicharset_extractor zzz.ocra.exp0.box

    jlpoole@themis ~/work/tess/samples $ unicharset_extractor zzz.ocra.exp0.box
    Extracting unicharset from zzz.ocra.exp0.box
    Wrote unicharset file ./unicharset.
    jlpoole@themis ~/work/tess/samples $

# created font_properties
    jlpoole@themis ~/work/tess/samples $ cat > font_properties
    ocra 0 0 1 0 0
    jlpoole@themis ~/work/tess/samples $

mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr 

    jlpoole@themis ~/work/tess/samples $ mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr
    Warning: No shape table file present: shapetable
    Reading zzz.ocra.exp0.tr ...
    Flat shape table summary: Number of shapes = 10 max unichars = 1 number with multiple unichars = 0
    Done!
    jlpoole@themis ~/work/tess/samples $

cntraining zzz.ocra.exp0.tr

    jlpoole@themis ~/work/tess/samples $ cntraining zzz.ocra.exp0.tr
    Reading zzz.ocra.exp0.tr ...
    Clustering ...

    Writing normproto ...
    jlpoole@themis ~/work/tess/samples $

cp normproto zzz.normproto
cp inttemp zzz.inttemp
cp pffmtable zzz.pffmtable

    jlpoole@themis ~/work/tess/samples $ cp normproto zzz.normproto
    jlpoole@themis ~/work/tess/samples $ cp inttemp zzz.inttemp
    jlpoole@themis ~/work/tess/samples $ cp pffmtable zzz.pffmtable
    jlpoole@themis ~/work/tess/samples $

combine_tessdata zzz.

    jlpoole@themis ~/work/tess/samples $ combine_tessdata zzz.
    Combining tessdata files
    TessdataManager combined tesseract data files.
    Offset for type 0 is -1
    Offset for type 1 is 140
    Offset for type 2 is -1
    Offset for type 3 is 799
    Offset for type 4 is 135319
    Offset for type 5 is 135402
    Offset for type 6 is -1
    Offset for type 7 is -1
    Offset for type 8 is -1
    Offset for type 9 is -1
    Offset for type 10 is -1
    Offset for type 11 is -1
    Offset for type 12 is -1
    Offset for type 13 is -1
    Offset for type 14 is -1
    Offset for type 15 is -1
    Offset for type 16 is -1
    jlpoole@themis ~/work/tess/samples $

jlpoole@themis ~/work/tess/samples $ ls
OCRA_numbers_variety.png            output.txt         zzz.ocra.exp0.png
OCRA_numbers_variety.tif            pffmtable          zzz.ocra.exp0.tr
font_properties                     shapetable         zzz.ocra.exp0.txt
inttemp                             unicharset         zzz.pffmtable
normproto                           zzz.inttemp        zzz.traineddata
ocr_sample_numbersonly_cropped.png  zzz.normproto      zzz.unicharset
ocr_sample_numbersonly_cropped.tif  zzz.ocra.exp0.box
jlpoole@themis ~/work/tess/samples $

jlpoole@themis ~/work/tess/samples $ su
Password:
themis samples # cp zzz.traineddata /usr/local/share/tessdata/
themis samples # exit

jlpoole@themis ~/work/tess/samples $ tesseract zzz.ocra.exp0.png output -l zzz  
tesseract: unicharmap.cpp:105: bool UNICHARMAP::contains(const char*) const: 
Assertion `*unichar_repr != '\0'' failed.
Aborted
jlpoole@themis ~/work/tess/samples $

Original issue reported on code.google.com by jlpool...@gmail.com on 19 Feb 2012 at 3:59

GoogleCodeExporter commented 9 years ago
Here's the final product: zzz.traineddata
and the sample file containing "0..9" in OCRA font: zzz.ocra.exp0.png

Original comment by jlpool...@gmail.com on 19 Feb 2012 at 4:11

Attachments:

GoogleCodeExporter commented 9 years ago
Here are the intermediary files

Original comment by jlpool...@gmail.com on 19 Feb 2012 at 4:26

Attachments:

GoogleCodeExporter commented 9 years ago
The error seems to be with the pffmtable that was generated:

00000000  0b 00 00 00 00 00 48 00  41 00 50 00 4e 00 3e 00  |......H.A.P.N.>.|
00000010  4f 00 45 00 3e 00 50 00  47 00 4e 55 4c 4c 20 30  |O.E.>.P.G.NULL 0|
00000020  0a 30 20 37 32 0a 6c 20  36 35 0a 32 20 38 30 0a  |.0 72.l 65.2 80.|
00000030  33 20 37 38 0a 34 20 36  32 0a 35 20 37 39 0a 36  |3 78.4 62.5 79.6|
00000040  20 36 39 0a 37 20 36 32  0a 38 20 38 30 0a 39 20  | 69.7 62.8 80.9 |
00000050  37 31 0a                                          |71.|

Tesseract fails when trying to read it:

When trying to read it, 

(gdb) bt
#0  tesseract::Classify::ReadNewCutoffs (this=0x7ffff730a020, 
CutoffFile=0x7ffff74802a0, swap=false, end_offset=135401, 
Cutoffs=0x7ffff7314020)
    at third_party/tesseract/classify/cutoffs.cpp:76
#1  0x0000000000541ee4 in tesseract::Classify::InitAdaptiveClassifier 
(this=0x7ffff730a020, load_pre_trained_templates=true)
    at third_party/tesseract/classify/adaptmatch.cpp:571
#2  0x0000000000526ec8 in tesseract::Wordrec::program_editup 
(this=0x7ffff730a020, textbase=0x7ffff7306668 "out", init_classifier=true, 
init_dict=true)
    at third_party/tesseract/wordrec/tface.cpp:56

(gdb) p Class
$4 = 
"\000\000\000\000\000H\000A\000P\000N\000>\000O\000E\000>\000P\000G\000NULL\000"

Will need to dive in further.

Original comment by david.e...@gmail.com on 26 Feb 2012 at 8:09

GoogleCodeExporter commented 9 years ago
downloaded zzz.ocra.exp0.png and tested under version 3.02 Generated 
zzz.traineddata under version 3.02 - attached files please see outputtext 
"testocra.txt"wherein it is displayed the output as "0123456789" correctly.
Even attached zzz.unicharset file for reference. tested in winXP(sp3)

Original comment by withbles...@gmail.com on 26 Feb 2012 at 10:05

Attachments:

GoogleCodeExporter commented 9 years ago
I updated my Subversion pull of Tesseract 
(/usr/local/src/tesseract-ocr-read-only) to build 681.

I ran in /usr/local/src/tesseract-ocr-read-only
make clean
./autogen
./configure
make
make install
ldconf

Then I built a fresh workspace, "samples_b681" and went through the steps of 
building the training file in the new workspace: 
/home/jlpoole/work/tess/samples_b681.

My test resulted in:

jlpoole@themis ~/work/tess/samples_b681 $ tesseract num.ocra.exp0.png output -l 
num
tesseract: unicharmap.cpp:105: bool UNICHARMAP::contains(const char*) const: 
Assertion `*unichar_repr != '\0'' failed.
Aborted
jlpoole@themis ~/work/tess/samples_b681 $

I compared the zzz.unicharset from Comment #3 with num.unicharset and found an 
entry that looks malformed:

jlpoole@themis ~/work/tess/samples_b681 $ cat num.unicharset
11
NULL 0 NULL 0
0 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0 #    # 0 [30 ]0
l 3 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 #   # l [6c ]a
2 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0 #    # 2 [32 ]0
3 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 4 0 0 #    # 3 [33 ]0
4 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 5 0 0 #    # 4 [34 ]0
5 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 6 0 0 #    # 5 [35 ]0
6 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 7 0 0 #    # 6 [36 ]0
7 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 8 0 0 #    # 7 [37 ]0
8 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 9 0 0 #    # 8 [38 ]0
9 8 0,255,0,255,0,32767,0,32767,0,32767 NULL 10 0 0 #   # 9 [39 ]0
jlpoole@themis ~/work/tess/samples_b681 $

Moreover, the pffmtable looks to have the same problem referenced in Comment 
#2.  Here's what I generated under build 681:

jlpoole@themis ~/work/tess/samples_b681 $ cat num.pffmtable

HAPN>OE>PGNULL 0
0 72
l 65
2 80
3 78
4 62
5 79
6 69
7 62
8 80
9 71
jlpoole@themis ~/work/tess/samples_b681 $

I am attaching my workspace that has all the files.

The only thing that possibly could be affecting my outcome is a cache of other 
tesseract builds I installed previously; I've been trying several versions of 
tesseract to see if the problem I had is related to a version change.   I am 
assuming that the "make install" for build 681 would cleanly install Build 681 
overwriting anything that it needs to.  

I'm attached a tar of my testing workbench as well as an HTML file I have as a 
work-in-progress to document how to train for a font (this should be helpful to 
others similarly situated and not intimately familiar with Tesseract).

Original comment by jlpool...@gmail.com on 26 Feb 2012 at 7:32

Attachments:

GoogleCodeExporter commented 9 years ago
Remembering that my goal here is to OCR some already-printed text, I tried 
Comment #4's training file and confirmed that I obtained correct output of: 
0123456789 against the sample PNG.  I hoping that the training of 10 digits 
will suffice; and now I'll move onto testing against some real life samples.  I 
just wanted to say thank you for making your traindata available -- I 
(hopefully) can now move forward with the main project.

Original comment by jlpool...@gmail.com on 26 Feb 2012 at 8:16

GoogleCodeExporter commented 9 years ago
An aside:

For anyone interested in comparing Tesseract with gocr, here's a comparison of 
tesseract (trained with OCRA per comment #4 above) with gocr for 30 samples 
very similiar to the PNG posted above: small image files containing only a 
sequence of numbers.

#    Tesseract    gocr    gocr match
0    105884      105884    t
1    67110        67_10    f
2    7524          7524    t
3    6212          6212    t
4    84326        _432_    f
5    84325        8432S    f
6    221064      221D64    f
7    219704      2_9704    f
8    111544      _11S44    f
9    231636      231636    t
10    124780     1247_0    f
11    41452       41452    t
12    41456       4_4S6    f
13    123298     123298    t
14    123299     1232__    f
15    123283     123283    t
16    42065       _2065    f
17    127916     _279_6    f
18    231638     23_638    f
19    1620         1620    t
20    29242       29242    t
21    29239       29239    t
22    123253     _232S3    f
23    114153     1_4_53    f
24    20318       203_8    f
25    102950     102950    t
26    79625       79_25    f
27    75994       75994    t
28    215187     2_5187    f
29    51165       511_5    f

    100.00%      40.00%    

Original comment by jlpool...@gmail.com on 26 Feb 2012 at 10:58

GoogleCodeExporter commented 9 years ago
I try to reproduce problem on openSUSE 12.1 64bit (with tesseract 3.02).
When I used you your zzz.traineddata I got the same error as you.
So I run training with this steps (with your png+box):
$ tesseract zzz.ocra.exp0.png zzz.ocra.exp0 nobatch box.train
$ unicharset_extractor zzz.ocra.exp0.box
$ echo "ocra 0 0 1 0 0" >font_properties
$ shapeclustering -F font_properties -U unicharset zzz.ocra.exp0.tr
$ mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr
$ cntraining zzz.ocra.exp0.tr
$ cp normproto zzz.normproto
$ cp inttemp zzz.inttemp
$ cp pffmtable zzz.pffmtable
$ cp shapetable zzz.shapetable
$ sudo cp zzz.traineddata /usr/local/share/tessdata/
$ tesseract zzz.ocra.exp0.png output -l zzz
and it works!

So I unpacked your and my traineddata and compared them with with kdiff3 (it 
for comparing text files, but it show also if there are differences in binary 
files) with this result:
1. you miss shapetable file (but I do nothing this is problem)
2. there is difference in inttemp file. 
Other files are the same. So I would expect problem is in inttemp file.

My zzz.traineddata is in attachment.

Original comment by zde...@gmail.com on 27 Feb 2012 at 9:43

Attachments:

GoogleCodeExporter commented 9 years ago
I had not been creating the shapetable, that was a problem.

I updated tesseract to version 684 (today's Subversion high watermark) and 
successfully trained tesseract for the OCRA font.  I have a Perl script which 
facilitates the training process and will introduce it in a separate issue to 
be filed herein.

Comment #8 helped me very much, thank you.

This issue may be closed or marked solved.

Original comment by jlpool...@gmail.com on 28 Feb 2012 at 6:12

GoogleCodeExporter commented 9 years ago

Original comment by zde...@gmail.com on 28 Feb 2012 at 9:06