meego / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

mftraining crashes when using more than one tr file #537

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Create 2 sets tif/box/tr (in this case eng.sysd.exp0 and eng.sysd.exp1)
2. unicharset_extractor eng.sysd.exp0.box eng.sysd.exp1.box
3. mftraining -F eng.font_properties -U unicharset -O eng.unicharset 
eng.sysd.exp0.tr eng.sysd.exp1.tr

What is the expected output? What do you see instead?
mftraining crashes, outputting "
Writing Merged Microfeat ...Class->NumConfigs == 
this->fontset_table_.get(Class->font_set_id).size:Error:Assert failed:in file 
.\intproto.cpp, line 1268"

What version of the product are you using? On what operating system?
3.01 on Windows 7 x64.  I've also tried it on Windows XP 32bit with the same 
result.  I also tried compiling the latest revision of tesseract with the same 
error(although the line number was different)

Please provide any additional information below.

I have tried this with multiple training sources and I seem to have this 
problem whenever I try to train with multiple tr files.  If I specify just one 
tr file, mftraining works properly.

Original issue reported on code.google.com by nickkeln...@gmail.com on 21 Aug 2011 at 7:19

Attachments:

GoogleCodeExporter commented 9 years ago
I'm having the same error, more info in a second

Original comment by sebek.m...@gmail.com on 21 Sep 2011 at 5:40

GoogleCodeExporter commented 9 years ago
mftraining.exe -F font_properties blue.test.exp1.tr blue.test.exp2.tr
Reading blue.test.exp1.tr...
Reading blue.test.exp2.tr...
Class->NumConfigs == this->fontset_table_.get(Class->font_set_id).size: 
Error:Assert failed: in file ..\\classify\intproto.cpp, line 1312

Original comment by sebek.m...@gmail.com on 21 Sep 2011 at 5:47

GoogleCodeExporter commented 9 years ago
and that is with the SVN latest on windows 7 32-bit.

Original comment by sebek.m...@gmail.com on 21 Sep 2011 at 5:48

GoogleCodeExporter commented 9 years ago
Do you have any solution?
I was getting errors from mftraining with a multi page tiff then tried to train 
with all the tiff files as single pages. i have 68 tiff files that i generated 
the box files from them, may the problem be about the number of files?

Original comment by mervet2...@gmail.com on 29 Sep 2011 at 1:50

GoogleCodeExporter commented 9 years ago
What exactly was your error? Try to post more information so when someone comes 
along who knows what they're doing they can implement a fix. 

It seems like a bug that anyone with some decent amount of experience 
developing tesseract would be able to handle quickly, but I wasn't successful 
in acquainting myself with the program's structure in the time I had available. 

I was feeding multiple single-page tiffs into mftraining when it crashed, but 
again it worked when they were fed individually. 

Have you tried feeding only two files in and seeing if mftraining doesn't crash?

Original comment by sebek.m...@gmail.com on 29 Sep 2011 at 9:16

GoogleCodeExporter commented 9 years ago
This is a "feature" but it will be fixed in 3.02.
Currently each tr file *must* represent a different font, as it will create a 
different config and the code assumes that there is only one config per font, 
hence the assert.
WORK-AROUND 1: Use a multi-page tiff for multiple images with the same font. 
They will go into a single tr file during the box.train phase.
WORK-AROUND 2: Cat together multiple tr files that represent the same font.
WORK-AROUND 3: Use a different font name and create a different entry for it in 
the font_properties file.

A future version, probably 3.02, will use the font name contained in the tr 
file instead of the file name, and sort the font data on reading the tr files, 
and this restriction will be lifted.

Original comment by theraysm...@gmail.com on 1 Oct 2011 at 4:29

GoogleCodeExporter commented 9 years ago
Issue 578 has been merged into this issue.

Original comment by zde...@gmail.com on 18 Nov 2011 at 4:43

GoogleCodeExporter commented 9 years ago
Issue 587 has been merged into this issue.

Original comment by zde...@gmail.com on 24 Nov 2011 at 8:15

GoogleCodeExporter commented 9 years ago
Issue 562 has been merged into this issue.

Original comment by zde...@gmail.com on 23 Feb 2012 at 8:20

GoogleCodeExporter commented 9 years ago
Please test current svn code (3.02):

tesseract eng.sysd.exp0.tif eng.sysd.exp0 box.train
tesseract eng.sysd.exp1.tif eng.sysd.exp1 box.train
unicharset_extractor eng.sysd.exp0.box eng.sysd.exp1.box
shapeclustering -F font_properties -U unicharset eng.sysd.exp0.tr 
eng.sysd.exp1.tr
mftraining -F eng.font_properties -U unicharset -O eng.unicharset 
eng.sysd.exp0.tr eng.sysd.exp1.tr

Original comment by zde...@gmail.com on 30 Jul 2012 at 8:32