tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.92k stars 9.37k forks source link

Segmentation fault when set variable "classify_enable_adaptive_matcher" = 0 #256

Open nam-leduc opened 8 years ago

nam-leduc commented 8 years ago

I follow comment in this link: FAQ There are inconsistent r....

But when I using tesseract with that options:

classify_enable_learning 0
classify_enable_adaptive_matcher 0

I received one message like following: segmentationfault

I think this is one bug, because setting in config file is common for user. I find on all forum but not have any topic talk about this issue.

zdenop commented 8 years ago

Please provide also input files (test1.tif and config.txt)

nam-leduc commented 8 years ago

Hi zdenop, thanks for your quick response. I would like attacht 2 files config.txt

Following tif file, I can not upload to this comment, therefore I upload to my repository. You can access to following link to get tif file. https://github.com/nam-leduc/positioning/blob/master/test1.tif

Best regards, Le Duc. Nam

zdenop commented 8 years ago

What OS are you using? Did you try to install tesseract (as I see you from screenshot you are using not installed tesseract) and than use tesseract? Do you have more versions of tesseract installed?

amitdo commented 8 years ago

I can reproduce this issue.

config.txt

tesseract phototest.tif phototest config.txt
Tesseract Open Source OCR Engine v3.05.00dev-266-gb1c1382 with Leptonica
Page 1
Segmentation fault (core dumped)

But this one works...

tesseract phototest.tif phototest -c classify_enable_learning=0 -c classify_enable_adaptive_matcher=0
Tesseract Open Source OCR Engine v3.05.00dev-266-gb1c1382 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

phototest.txt

amitdo commented 8 years ago
gdb tesseract

(gdb) run phototest.tif phototest config.txt
Starting program: /usr/local/bin/tesseract phototest.tif phototest config.txt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Tesseract Open Source OCR Engine v3.05.00dev-266-gb1c1382 with Leptonica
Page 1

Program received signal SIGSEGV, Segmentation fault.
tesseract::Tesseract::recog_all_words (this=0x808c00, page_res=0x81cb90, 
    monitor=monitor@entry=0x0, target_word_box=target_word_box@entry=0x0, 
    word_config=word_config@entry=0x0, dopasses=dopasses@entry=0)
    at control.cpp:320
320     } else if (!AdaptiveClassifierIsEmpty()) {

(gdb) backtrace
#0  tesseract::Tesseract::recog_all_words (this=0x808c00, 
    page_res=0x81cb90, monitor=monitor@entry=0x0, 
    target_word_box=target_word_box@entry=0x0, 
    word_config=word_config@entry=0x0, dopasses=dopasses@entry=0)
    at control.cpp:320
#1  0x00007ffff769929d in tesseract::TessBaseAPI::Recognize (
    this=this@entry=0x7fffffffdce0, monitor=0x0) at baseapi.cpp:902
#2  0x00007ffff76994e4 in tesseract::TessBaseAPI::ProcessPage (
    this=this@entry=0x7fffffffdce0, pix=0x83f110, 
    page_index=page_index@entry=0, 
    filename=filename@entry=0x7fffffffe257 "phototest.tif", 
    retry_config=retry_config@entry=0x0, 
    timeout_millisec=timeout_millisec@entry=0, renderer=renderer@entry=
    0x81cb50) at baseapi.cpp:1231
#3  0x00007ffff7699a5c in tesseract::TessBaseAPI::ProcessPagesMultipageTiff (this=this@entry=0x7fffffffdce0, data=data@entry=0xdde558 "II*", 
    size=38668, filename=filename@entry=0x7fffffffe257 "phototest.tif", 
    retry_config=retry_config@entry=0x0, 
    timeout_millisec=timeout_millisec@entry=0, 
    renderer=renderer@entry=0x81cb50, tessedit_page_number=-1)
    at baseapi.cpp:1064
#4  0x00007ffff769a0c3 in tesseract::TessBaseAPI::ProcessPagesInternal (
    this=this@entry=0x7fffffffdce0, filename=<optimized out>, 
---Type <return> to continue, or q <return> to quit---
    retry_config=retry_config@entry=0x0, 
    timeout_millisec=timeout_millisec@entry=0, renderer=0x81cb50)
    at baseapi.cpp:1183
#5  0x00007ffff769a2f0 in tesseract::TessBaseAPI::ProcessPages (
    this=this@entry=0x7fffffffdce0, filename=<optimized out>, 
    retry_config=retry_config@entry=0x0, 
    timeout_millisec=timeout_millisec@entry=0, renderer=<optimized out>)
    at baseapi.cpp:1081
#6  0x0000000000401f2a in main (argc=<optimized out>, argv=0x7fffffffde78)
    at tesseractmain.cpp:448
zdenop commented 8 years ago

But I can not ;-):

tesseract test1.tif test1.tif config.txt 
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Page 2

neither on linux (3.05.00dev-266-gb1c1382) or windows 7 (tesseract 3.04.01)

amitdo commented 8 years ago

Use my config.txt ...

He attached his config.txt with only 1 line, but said:

But when I using tesseract with that options:

classify_enable_learning 0 classify_enable_adaptive_matcher 0

My system is Ubuntu 14.04.

zdenop commented 8 years ago

Thanks! Now I am able to reproduce it (crash with config file and no crash with "-c").

amitdo commented 8 years ago

classify_enable_adaptive_matcher 0 in the config file is causing the crash, not classify_enable_learning 0.

Updated file: config.txt

nam-leduc commented 8 years ago

Hi @amitdo and @zdenop,

I'm sorry, I try with other config options for checking what option make crash, but I forget recovering to original config file.

classify_enable_learning 0
classify_enable_adaptive_matcher 0
tfmorris commented 8 years ago

It doesn't look to me like classify_enable_adaptive_matcher=0 is really supported any more. A bunch of the new code that's been added isn't conditionalized to check it.

The reason that it doesn't crash when the config variable is set on the command line is because that's done after the recognizer is initialized, so the necessary data structure has been created.

amitdo commented 8 years ago

Even 3.03 crashes with this config file.

amitdo commented 8 years ago

The reason that it doesn't crash when the config variable is set on the command line is because that's done after the recognizer is initialized, so the necessary data structure has been created.

Can you elaborate on this?

tfmorris commented 8 years ago

The config file is processed in the Init call here:

https://github.com/tesseract-ocr/tesseract/blob/master/api/tesseractmain.cpp#L372

while the command line config variables are processed in the call to SetVariablesFromCLArgs here:

https://github.com/tesseract-ocr/tesseract/blob/master/api/tesseractmain.cpp#L379

after the adaptive matcher has already been set up.

Even though the command line case doesn't crash, it is still using the adaptive matcher because the code that references it isn't guarded by the necessary config variable.

zdenop commented 8 years ago

@tfmorris @amitdo : beside this issues this behaviour should be documented: option "-c" can not be used for init only parameters. Or do we change of parsing of "-c" params?

amitdo commented 8 years ago

I think we should print a warning if someone try to set an init parameter using '-c var=val' in the command line. The relevant function is SetParamin ccutil/params.cpp.

tfmorris commented 8 years ago

Good suggestions, but neither is relevant here because classify_enable_adaptive_matcher isn't an init only parameter.

The issue is that the code has evolved so that classify_enable_adaptive_matcher=0 is no longer supported. There are sections of code which don't check this config variable and which assume that the adaptive matcher is correctly initialized. We can either drop the config variable or fix the code so that the variable protects everything that needs to be protected. I don't know how much work that'll be, but it's more than just this one place, because I fixed it and it just died somewhere else. No idea how many places there are to fix or whether it makes sense from @theraysmith's point of view to continue supporting this case.

In my opinion, the current order of evaluation (config files, then command line) is correct because it allows the config file to be overridden by the command line.

amitdo commented 8 years ago

About classify_enable_adaptive_matcher - Ray should handle it.

amitdo commented 8 years ago

But the FAQ should be fixed.

stweil commented 5 years ago

Is this still an issue with latest Tesseract? We'd like to know that before releasing Tesseract 4.0.0.

nam-leduc commented 4 years ago

Dear @stweil ,

I tested with latest version of tesseract and this issue still happened.

Environment

Tesseract input/output

Discussion

I don't know whether or not tesseract maintain the functionality for config "classify_enable_adaptive_matcher 0". However, I see that recommend for above setting are not on the FAQ of tesseract https://tesseract-ocr.github.io/tessdoc/FAQ for now. Do you think that is enough for close this bug?

Best regards, Le Duc. Nam

stweil commented 4 years ago

Thank you for testing. A segmentation fault is always something which has to be fixed, so this issue should be kept open.

amitdo commented 3 years ago

@tfmorris commented:

The issue is that the code has evolved so that classify_enable_adaptive_matcher=0 is no longer supported. There are sections of code which don't check this config variable and which assume that the adaptive matcher is correctly initialized. We can either drop the config variable or fix the code so that the variable protects everything that needs to be protected. I don't know how much work that'll be, but it's more than just this one place, because I fixed it and it just died somewhere else. No idea how many places there are to fix or whether it makes sense from @theraysmith's point of view to continue supporting this case.

This issue is open since March 2016. I suggest to remove the classify_enable_adaptive_matcher variable from classify.cpp and classify.h and fix two conditions in adaptmatch.cpp.

amitdo commented 2 years ago

@stweil, what about this issue?

stweil commented 2 years ago

I had a look on it, but saw no fast solution up to now. So I am afraid it will have to wait until after 5.0.0-rc1.

amitdo commented 2 years ago

I had a look on it, but saw no fast solution up to now.

The fast solution is to disable the variable classify_enable_adaptive_matcher (with #if 0) or remove it to prevent a crash.

In the future, if you'll find a way to prevent the crash, you can undo this removal.

stweil commented 2 years ago

That's right. It now still exists, but has no effect.