raffaeldantas / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
1 stars 0 forks source link

Crash in language model (v 3.03) #1248

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I recently have built a new traineddata file as described in the article 
"TrainingTesseract3". When I load it, the application crashes.

The reason is that a blob has the font ID = 35.
This is nonsense because I trained only one font.
It is still mysterious to me how this can happen.
There is another bug around!

But the really bad thing is that the entire application crashes because of that.
I know that Tesseract has been written in the stone age of computing when all 
applications were DOS applications. So the behavior was to print an assert on 
the screen and abort the program.

But today for a GUI application it is completely unacceptable that the 
application crashes because of a corrupt traineddata file.

Please add the following lines to LanguageModel::FillConsistencyInfo() which 
avoid this fatal crash. I know that this is not perfect. The function should 
return false or throw an exception, but it is better than crashing.

  // Check font and spacing consistency.
  if (fontinfo_table_->size() > 0 && parent_b != NULL) 
  {
    // Avoid application crash:
    if (b->fontinfo_id()         >= fontinfo_table_->size() ||
        b->fontinfo_id2()        >= fontinfo_table_->size() ||
        parent_b->fontinfo_id()  >= fontinfo_table_->size() ||
        parent_b->fontinfo_id2() >= fontinfo_table_->size())
    {
        tprintf("FATAL ERROR: The traineddata file is corrupt (Inconsistent font info)\n");
        consistency_info->inconsistent_font = true;
        return;
    }

etc..

Original issue reported on code.google.com by smaragds...@gmail.com on 4 Jul 2014 at 2:55

GoogleCodeExporter commented 9 years ago
This crash happens when you build the traineddata file without shapetable.

This is a contradiction to the manual saying that the shapetable should 
currently not be used except for the Indic languages. 

Original comment by smaragds...@gmail.com on 4 Jul 2014 at 6:27

GoogleCodeExporter commented 9 years ago
but there is also note, that "If you get error message like this..." (you did 
not provided any error messages, just your thoughts) you should add shapetable 
;-) 

Original comment by zde...@gmail.com on 4 Jul 2014 at 7:20

GoogleCodeExporter commented 9 years ago
Yes you are right. There is a note. 
Nevertheless the program should not crash because of that.

Original comment by smaragds...@gmail.com on 21 Jul 2014 at 5:25

GoogleCodeExporter commented 9 years ago
Documentation corrected.
Whether on not you run shapeclustering, the shapetable *must* be included in 
the traineddata file. One is created by mftraining if you don't run 
shapeclustering.
If you don't have one, you are likely to end up with garbled text output, so an 
assert failure is a good detector for it.

Original comment by theraysm...@gmail.com on 8 Oct 2014 at 1:08

GoogleCodeExporter commented 9 years ago
An assert failure is NEVER the correct way to indicate a severe error!

It is the opposite: It is a severe design error in Tesseract to trust so much 
in asserts.

If the application is compiled as Release all the asserts are removed be the 
compiler!

I see that in Tesseract many wrong decisions are taken and valid valid reports 
are closed without really solving the problem!

Original comment by smaragds...@gmail.com on 19 Dec 2014 at 1:01