gnewtothis101 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

mftraining segmentation fault with large 13,000+ character set #743

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.  unicharset_extractor -F font_properties zh.mingliu.exp0.box
2.  mftraining -F font_properties -U unicharset zh.mingliu.exp0.tr

What is the expected output? What do you see instead?
Expect mftraining to complete but instead segmentation faults and so unable to 
continue training.

What version of the product are you using? On what operating system?
linux
tesseract 3.02
 leptonica-1.69
  libgif 4.1.6 : libjpeg 6b : libpng 1.2.44 : libtiff 3.8.2 : zlib 1.2.5

Please provide any additional information below.
-Trying to train 13,000+ Chinese characters with 22 pages of tif images
-Tried to isolate problem by reducing size of character set
-by splitting the 22 pages into one set with 12 page and one set with 10 page 
mftraining passes on the individual set
therefore I believe something is limiting the character set

Original issue reported on code.google.com by whoister...@gmail.com on 14 Aug 2012 at 7:51

Attachments:

GoogleCodeExporter commented 9 years ago
The segmentation fault happens in mastertrainer.cpp 

void MasterTrainer::WriteInttempAndPFFMTable
   INT_TEMPLATES int_templates = classify->CreateIntTemplates(float_classes,shape_set);

Original comment by whoister...@gmail.com on 14 Aug 2012 at 9:18

GoogleCodeExporter commented 9 years ago
Added some printfs to further pinpoint in intproto.cpp

...
ClassId=12679 NumProtos=63 NumConfigs=1 fs.size=1 ProtoId=63 ConfigId=1
ClassId=12680 NumProtos=60 NumConfigs=1 fs.size=1 ProtoId=60 ConfigId=1
ClassId=12681 NumProtos=53 NumConfigs=1 fs.size=1 ProtoId=53 ConfigId=1
ClassId=12682 NumProtos=47 NumConfigs=1 fs.size=1 ProtoId=47 ConfigId=1
ClassId=12683 NumProtos=49 NumConfigs=1 fs.size=1 ProtoId=49 ConfigId=1
ClassId=12684 NumProtos=49 NumConfigs=1 fs.size=1

Looks like I almost completed, but segv in following code

   for (ProtoId = 0; ProtoId < FClass->NumProtos; ProtoId++) {
      AddIntProto(IClass);
      ConvertProto(ProtoIn(FClass, ProtoId), ProtoId, IClass);
      AddProtoToProtoPruner(ProtoIn(FClass, ProtoId), ProtoId, IClass,
                            classify_learning_debug_level >= 2);
      AddProtoToClassPruner(ProtoIn(FClass, ProtoId), ClassId, IntTemplates);
    }

I am guessing Memory that was previously allocated ran out at this point, I 
suspect it's
  CLASS_STRUCT* float_classes = SetUpForFloat2Int(*unicharset, mf_classes);
in mftraining.cpp

Original comment by whoister...@gmail.com on 14 Aug 2012 at 10:48

GoogleCodeExporter commented 9 years ago
Turns out in dict/matchdefs.h there is a ClassId limit:

/* define the maximum number of classes defined for any matcher
  and the maximum class id for any matcher. This must be changed
  if more different classes need to be classified */
#define MAX_NUM_CLASSES   12288

Just change it and mftraining pass.
#define MAX_NUM_CLASSES   22288

Original comment by whoister...@gmail.com on 15 Aug 2012 at 2:17

GoogleCodeExporter commented 9 years ago
Issue 670 has been merged into this issue.

Original comment by theraysm...@gmail.com on 21 Sep 2012 at 12:24

GoogleCodeExporter commented 9 years ago
I noticed this issue is still present since I just had the exact same problem 
with the current checkout.

Luckily whoister...'s solution still works, so I've created a diff patch to fix 
the issue. (attached)

I wonder if a larger MAX_NUM_CLASSES is known to have an impact on 
accuracy/memory/computation/etc. ?

Original comment by clements...@gmail.com on 8 Dec 2013 at 3:17

Attachments:

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r978.

Original comment by theraysm...@gmail.com on 10 Jan 2014 at 6:25

GoogleCodeExporter commented 9 years ago
After more testing the answer was yes it causes significant speed impact.
A fix to this problem is now in test and will be included in 3.03, with 
MAX_NUM_CLASSES set to MAX_INT16.

Original comment by theraysm...@gmail.com on 24 Jan 2014 at 8:03