tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.11k stars 9.4k forks source link

dict: Wrong values for hyphen_unichar_id_ and slash_unichar_id_ in comparision #1955

Open stweil opened 5 years ago

stweil commented 5 years ago

Valgrind reports several uninitialized memory accesses during OCR of an image taken from issue #1205. This can potentially give wrong or random OCR results.

Output from valgrind --leak-check=full --track-origins=yes bin/debug/x86_64-linux-gnu/src/api/tesseract issues/1205/32625771-8c733d9c-c563-11e7-9538-9053c9f7c542.png - with eng.traineddata from http://github.com/tesseract-ocr/tessdata:

==10189== Conditional jump or move depends on uninitialised value(s)
==10189==    at 0x211EE4: tesseract::Dict::compound_marker(int) (dict.h:113)
==10189==    by 0x20E9A4: tesseract::LanguageModel::GenerateDawgInfo(bool, int, int, BLOB_CHOICE const&, tesseract::ViterbiStateEntry const*) (language_model.cpp:812)
==10189==    by 0x20DD30: tesseract::LanguageModel::AddViterbiStateEntry(unsigned char, float, bool, int, int, BLOB_CHOICE*, tesseract::LanguageModelState*, tesseract::ViterbiStateEntry*, tesseract::LMPainPoints*, WERD_RES*, tesseract::BestChoiceBundle*, BlamerBundle*) (language_model.cpp:600)
==10189==    by 0x20CFE2: tesseract::LanguageModel::UpdateState(bool, int, int, BLOB_CHOICE_LIST*, tesseract::LanguageModelState*, tesseract::LMPainPoints*, WERD_RES*, tesseract::BestChoiceBundle*, BlamerBundle*) (language_model.cpp:340)
==10189==    by 0x219E48: tesseract::Wordrec::UpdateSegSearchNodes(float, int, GenericVector<tesseract::SegSearchPending>*, WERD_RES*, tesseract::LMPainPoints*, tesseract::BestChoiceBundle*, BlamerBundle*) (segsearch.cpp:212)
==10189==    by 0x219B09: tesseract::Wordrec::InitialSegSearch(WERD_RES*, tesseract::LMPainPoints*, GenericVector<tesseract::SegSearchPending>*, tesseract::BestChoiceBundle*, BlamerBundle*) (segsearch.cpp:177)
==10189==    by 0x2192C5: tesseract::Wordrec::SegSearch(WERD_RES*, tesseract::BestChoiceBundle*, BlamerBundle*) (segsearch.cpp:54)
==10189==    by 0x204CD5: tesseract::Wordrec::chop_word_main(WERD_RES*) (chopper.cpp:430)
==10189==    by 0x21B6BF: tesseract::Wordrec::cc_recog(WERD_RES*) (tface.cpp:116)
==10189==    by 0x19EA58: tesseract::Tesseract::recog_word_recursive(WERD_RES*) (tfacepp.cpp:109)
==10189==    by 0x19E5B8: tesseract::Tesseract::recog_word(WERD_RES*) (tfacepp.cpp:48)
==10189==    by 0x1950FA: tesseract::Tesseract::tess_segment_pass_n(int, WERD_RES*) (tessbox.cpp:48)
==10189==  Uninitialised value was created by a heap allocation
==10189==    at 0x4835E2F: operator new(unsigned long) (vg_replace_malloc.c:334)
==10189==    by 0x12F831: tesseract::TessBaseAPI::Init(char const*, int, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, bool (*)(STRING const&, GenericVector<char>*)) (baseapi.cpp:389)
==10189==    by 0x12F67E: tesseract::TessBaseAPI::Init(char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool) (baseapi.cpp:352)
==10189==    by 0x12D7F5: main (tesseractmain.cpp:514)
==10189== 
==10189== Conditional jump or move depends on uninitialised value(s)
==10189==    at 0x211EE4: tesseract::Dict::compound_marker(int) (dict.h:113)
==10189==    by 0x20D7AB: tesseract::LanguageModel::SetTopParentLowerUpperDigit(tesseract::LanguageModelState*) const (language_model.cpp:488)
==10189==    by 0x20CD7F: tesseract::LanguageModel::UpdateState(bool, int, int, BLOB_CHOICE_LIST*, tesseract::LanguageModelState*, tesseract::LMPainPoints*, WERD_RES*, tesseract::BestChoiceBundle*, BlamerBundle*) (language_model.cpp:284)
==10189==    by 0x219E48: tesseract::Wordrec::UpdateSegSearchNodes(float, int, GenericVector<tesseract::SegSearchPending>*, WERD_RES*, tesseract::LMPainPoints*, tesseract::BestChoiceBundle*, BlamerBundle*) (segsearch.cpp:212)
==10189==    by 0x219B09: tesseract::Wordrec::InitialSegSearch(WERD_RES*, tesseract::LMPainPoints*, GenericVector<tesseract::SegSearchPending>*, tesseract::BestChoiceBundle*, BlamerBundle*) (segsearch.cpp:177)
==10189==    by 0x2192C5: tesseract::Wordrec::SegSearch(WERD_RES*, tesseract::BestChoiceBundle*, BlamerBundle*) (segsearch.cpp:54)
==10189==    by 0x204CD5: tesseract::Wordrec::chop_word_main(WERD_RES*) (chopper.cpp:430)
==10189==    by 0x21B6BF: tesseract::Wordrec::cc_recog(WERD_RES*) (tface.cpp:116)
==10189==    by 0x19EA58: tesseract::Tesseract::recog_word_recursive(WERD_RES*) (tfacepp.cpp:109)
==10189==    by 0x19E5B8: tesseract::Tesseract::recog_word(WERD_RES*) (tfacepp.cpp:48)
==10189==    by 0x1950FA: tesseract::Tesseract::tess_segment_pass_n(int, WERD_RES*) (tessbox.cpp:48)
==10189==    by 0x14F8D3: tesseract::Tesseract::match_word_pass_n(int, WERD_RES*, ROW*, BLOCK*) (control.cpp:1641)
==10189==  Uninitialised value was created by a heap allocation
==10189==    at 0x4835E2F: operator new(unsigned long) (vg_replace_malloc.c:334)
==10189==    by 0x12F831: tesseract::TessBaseAPI::Init(char const*, int, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, bool (*)(STRING const&, GenericVector<char>*)) (baseapi.cpp:389)
==10189==    by 0x12F67E: tesseract::TessBaseAPI::Init(char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool) (baseapi.cpp:352)
==10189==    by 0x12D7F5: main (tesseractmain.cpp:514)
==10189== 
==10189== Conditional jump or move depends on uninitialised value(s)
==10189==    at 0x211EE4: tesseract::Dict::compound_marker(int) (dict.h:113)
==10189==    by 0x20FA78: tesseract::LanguageModel::FillConsistencyInfo(int, bool, BLOB_CHOICE*, tesseract::ViterbiStateEntry*, WERD_RES*, tesseract::LMConsistencyInfo*) (language_model.cpp:1064)
==10189==    by 0x20E035: tesseract::LanguageModel::AddViterbiStateEntry(unsigned char, float, bool, int, int, BLOB_CHOICE*, tesseract::LanguageModelState*, tesseract::ViterbiStateEntry*, tesseract::LMPainPoints*, WERD_RES*, tesseract::BestChoiceBundle*, BlamerBundle*) (language_model.cpp:649)
==10189==    by 0x20D1B1: tesseract::LanguageModel::UpdateState(bool, int, int, BLOB_CHOICE_LIST*, tesseract::LanguageModelState*, tesseract::LMPainPoints*, WERD_RES*, tesseract::BestChoiceBundle*, BlamerBundle*) (language_model.cpp:371)
==10189==    by 0x219E48: tesseract::Wordrec::UpdateSegSearchNodes(float, int, GenericVector<tesseract::SegSearchPending>*, WERD_RES*, tesseract::LMPainPoints*, tesseract::BestChoiceBundle*, BlamerBundle*) (segsearch.cpp:212)
==10189==    by 0x219B09: tesseract::Wordrec::InitialSegSearch(WERD_RES*, tesseract::LMPainPoints*, GenericVector<tesseract::SegSearchPending>*, tesseract::BestChoiceBundle*, BlamerBundle*) (segsearch.cpp:177)
==10189==    by 0x2192C5: tesseract::Wordrec::SegSearch(WERD_RES*, tesseract::BestChoiceBundle*, BlamerBundle*) (segsearch.cpp:54)
==10189==    by 0x204CD5: tesseract::Wordrec::chop_word_main(WERD_RES*) (chopper.cpp:430)
==10189==    by 0x21B6BF: tesseract::Wordrec::cc_recog(WERD_RES*) (tface.cpp:116)
==10189==    by 0x19EA58: tesseract::Tesseract::recog_word_recursive(WERD_RES*) (tfacepp.cpp:109)
==10189==    by 0x19E5B8: tesseract::Tesseract::recog_word(WERD_RES*) (tfacepp.cpp:48)
==10189==    by 0x1950FA: tesseract::Tesseract::tess_segment_pass_n(int, WERD_RES*) (tessbox.cpp:48)
==10189==  Uninitialised value was created by a heap allocation
==10189==    at 0x4835E2F: operator new(unsigned long) (vg_replace_malloc.c:334)
==10189==    by 0x12F831: tesseract::TessBaseAPI::Init(char const*, int, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, bool (*)(STRING const&, GenericVector<char>*)) (baseapi.cpp:389)
==10189==    by 0x12F67E: tesseract::TessBaseAPI::Init(char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool) (baseapi.cpp:352)
==10189==    by 0x12D7F5: main (tesseractmain.cpp:514)
stweil commented 5 years ago

It looks like slash_unichar_id_ is undefined.

stweil commented 5 years ago

Pull request #1956 fixes the Valgrind warnings, so the program flow is now well defined in dict.h:114. Nevertheless I think that code does not do what is expected, because hyphen_unichar_id_ and slash_unichar_id_ were initialized, but don't have the values which they should have.