tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.13k stars 9.39k forks source link

~ObjectCache(): WARNING! LEAK! object #529

Closed Shreeshrii closed 7 years ago

Shreeshrii commented 7 years ago

While trying to process a gif file, when leptonica was not built with giflib, get the following messages

/mnt/c/Users/User/shree$ tesseract san001.gif san001-gif --psm 6 --oem 4 -l san
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Error in pixReadStreamGif: function not present
Error in pixReadStream: gif: no pix returned
Error in pixRead: pix not read
Error during processing.
ObjectCache(0x7f6a7eb9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x1fd72f0 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatalstm-punc-dawg)
ObjectCache(0x7f6a7eb9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x1fd8290 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatalstm-word-dawg)
ObjectCache(0x7f6a7eb9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x1fd7110 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatalstm-number-dawg)
ObjectCache(0x7f6a7eb9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x426afd0 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatapunc-dawg)
ObjectCache(0x7f6a7eb9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x1fd6f20 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddataword-dawg)
ObjectCache(0x7f6a7eb9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x426ae60 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatanumber-dawg)
ObjectCache(0x7f6a7eb9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x472cef0 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatabigram-dawg)
ObjectCache(0x7f6a7eb9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x4784e50 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatafreq-dawg)

/mnt/c/Users/User/shree$ tesseract san001.gif san001-gif --psm 6 --oem 3 -l san
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Error in pixReadStreamGif: function not present
Error in pixReadStream: gif: no pix returned
Error in pixRead: pix not read
Error during processing.
ObjectCache(0x7fc1e7d9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x58291f0 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatapunc-dawg)
ObjectCache(0x7fc1e7d9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x5829010 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddataword-dawg)
ObjectCache(0x7fc1e7d9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x58290d0 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatanumber-dawg)
ObjectCache(0x7fc1e7d9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x5829190 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatabigram-dawg)
ObjectCache(0x7fc1e7d9e5c0)::~ObjectCache(): WARNING! LEAK! object 0x5828fb0 still has count 1 (id /mnt/c/Users/User/tesseract-ocr/tessdata/san.traineddatafreq-dawg)
amitdo commented 7 years ago

Error in pixReadStreamGif: function not present Error in pixReadStream: gif: no pix returned Error in pixRead: pix not read

These error messages are from Leptonica.

Error during processing.

This one and the ObjectCache scary messages are from Tesseract.

https://github.com/tesseract-ocr/tesseract/blob/a75ab450a/ccutil/object_cache.h#L42

Looks like a bug in Tesseract.

Shreeshrii commented 7 years ago

Error is related to input file not being found.

C:\Users\User>tesseract abc.jpg abc
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
ObjectCache(5A7A9AC8)::~ObjectCache(): WARNING! LEAK! object 032C6178 still has count 1 (id C:\Program Files (x86)\Tesseract-OCR/tessdata/eng.traineddatapunc-dawg)
ObjectCache(5A7A9AC8)::~ObjectCache(): WARNING! LEAK! object 032C51C8 still has count 1 (id C:\Program Files (x86)\Tesseract-OCR/tessdata/eng.traineddataword-dawg)
ObjectCache(5A7A9AC8)::~ObjectCache(): WARNING! LEAK! object 032C5278 still has count 1 (id C:\Program Files (x86)\Tesseract-OCR/tessdata/eng.traineddatanumber-dawg)
ObjectCache(5A7A9AC8)::~ObjectCache(): WARNING! LEAK! object 032C9C28 still has count 1 (id C:\Program Files (x86)\Tesseract-OCR/tessdata/eng.traineddatabigram-dawg)
ObjectCache(5A7A9AC8)::~ObjectCache(): WARNING! LEAK! object 032C50C8 still has count 1 (id C:\Program Files (x86)\Tesseract-OCR/tessdata/eng.traineddatafreq-dawg)

C:\Users\User>
prodanovic commented 7 years ago

Did you succeed in making it run on any image? I get the same error message on both .png and .jpg

Shreeshrii commented 7 years ago

Check the version of leptonica and image livs by

tesseract -v

See if png and jpg libs are listed

In my case, giflib is not included in leptonica, hence it does not process gifs. Png and jpg files are processed, though there are some info and warning messages from leptonica.

The latest GitHub version of leptonica and tesseract have fewer of these msgs.

On 14-Dec-2016 12:10 AM, "Srdjan Prodanovic" notifications@github.com wrote:

Did you succeed in making it run on any image? I get the same error message on both .png and .jpg

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/529#issuecomment-266823793, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2bnKAzARMj2N1Wskc4ioJ080XMnks5rHuaygaJpZM4LDjOe .

prodanovic commented 7 years ago

You are completely right, when bulding Leptonica from source I relied on instructions from Tesseract Wiki, which are incomplete.

From http://www.leptonica.org/source/README.html#DEPENDENCIES Leptonica is configured to handle image I/O using these external libraries: libjpeg, libtiff, libpng, libz, libgif, libwebp, libopenjp2 These libraries are easy to obtain. For example, using the debian package manager: sudo apt-get install where = {libpng12-dev, libjpeg62-dev, libtiff4-dev}.

Now png and jpeg rendering libs got integrated when I rebuilt everything again. ubuntu@XXX$ tesseract -v tesseract 4.00.00alpha leptonica-1.73 libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.51 : libtiff 4.0.3 : zlib 1.2.8

Thanks!

Shreeshrii commented 7 years ago

The error also comes when tesseract is not able to write an output file eg. missing output directory .. or other such io issues.

Shreeshrii commented 7 years ago

getting error re: LEAK! when input file not found - wrong name given

shree@ALL-IN-1-TOUCH:/mnt/c/Users/User/shree/kannada$ tesseract scan001.tif scan001 --oem 1 -l kan makebox
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
ObjectCache(0x7fd27279eac0)::~ObjectCache(): WARNING! LEAK! object 0x2b114e0 still has count 1 (id /mnt/c/Users/User/shree/tessdata/kan.traineddatalstm-punc-dawg)
ObjectCache(0x7fd27279eac0)::~ObjectCache(): WARNING! LEAK! object 0x2b127c0 still has count 1 (id /mnt/c/Users/User/shree/tessdata/kan.traineddatalstm-word-dawg)
ObjectCache(0x7fd27279eac0)::~ObjectCache(): WARNING! LEAK! object 0x2b11360 still has count 1 (id /mnt/c/Users/User/shree/tessdata/kan.traineddatalstm-number-dawg)
taylankoca commented 7 years ago

Getting the same error.

$ tesseract test.jpeg file Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Error in fopenReadStream: file not found Error in findFileFormat: image file not found Error during processing. ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x1b46fc0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatalstm-punc-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x1b46db0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatalstm-word-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x1b46e60 still has count 1 (id /usr/local/share/tessdata/eng.traineddatalstm-number-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x27604a0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatapunc-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x2761910 still has count 1 (id /usr/local/share/tessdata/eng.traineddataword-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x27601f0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatanumber-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x1b46bc0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatabigram-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x286b8a0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatafreq-dawg)

My version is:

$ tesseract --version tesseract 4.00.00alpha leptonica-1.74.1 libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

Found AVX Found SSE

Shreeshrii commented 7 years ago

file not found Error

make sure that test.jpeg is in the path. try

ls test.jpeg tesseract test.jpeg file

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Feb 18, 2017 at 3:13 PM, Taylan Koca notifications@github.com wrote:

Getting the same error.

tesseract test.jpeg file Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Error in fopenReadStream: file not found Error in findFileFormat: image file not found Error during processing. ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x1b46fc0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatalstm-punc-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x1b46db0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatalstm-word-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x1b46e60 still has count 1 (id /usr/local/share/tessdata/eng.traineddatalstm-number-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x27604a0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatapunc-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x2761910 still has count 1 (id /usr/local/share/tessdata/eng.traineddataword-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x27601f0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatanumber-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x1b46bc0 still has count 1 (id /usr/local/share/tessdata/eng.traineddatabigram-dawg) ObjectCache(0x7fd8a283bac0)::~ObjectCache(): WARNING! LEAK! object 0x286b8a0 still has count 1 (id /usr/local/share/tessdata/eng. traineddatafreq-dawg)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/529#issuecomment-280834544, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o5RXPUonbIVRjCUiSwdhSpbjPencks5rdr1OgaJpZM4LDjOe .

Margorp commented 7 years ago

I encounted the same error that tesseract not able to read gif. error look like this: ... Error in pixReadStreamGif: function not present ... When I checked the version of tesseract, it did not show gif library: tesseract 4.00.00alpha leptonica-1.74.1 libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.25 : libtiff 4.0.6 : zlib 1.2.8

So I install libgif-dev from synaptic and recompile the leptonica again. This time when I checked the version of tesseract again, it gave me: tesseract 4.00.00alpha leptonica-1.74.1 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.25 : libtiff 4.0.6 : zlib 1.2.8

After I tried again using gif with tesseract, it gave no error any more.

trey-pindrop commented 7 years ago

Making the PWD be where the file is located works, but I consider that broken. If I supply a relative or absolute path to an input file, tesseract should read that without issue. Instead it says the file can't be found, which is bizarre. I'm going to change my script to use cat and have tesseract read from standard input, because this is so broken.

tesseract --version
tesseract 3.05.00 leptonica-1.74.1 libjpeg 8d : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.8

I'm using it on MacOS through homebrew, and my source image is a PNG.

file not found Error

make sure that test.jpeg is in the path. try

ls test.jpeg tesseract test.jpeg file

ShreeDevi

rnmanhon commented 7 years ago

In the image_to_string function of tesseract.py, the input image is converted to bmp and stored in /tmp directory in my Linux box. When this path (with '/') pass to subprocess.Popen, tesseract cannot find the file. You can simulate this by running the following in the command prompt.

tesseract /tmp/tess__cwb36mk.bmp output.txt

Shreeshrii commented 7 years ago

@rfschtkt What about these 'LEAK' related warnings?

rfschtkt commented 7 years ago

Can you break on the warning and get some more information? From a quick look, Dict::Load() doesn't add load_bigram_dawg to dawgs_, which either is a bug or should be documented in a comment, but I don't know whether fixing that would solve this problem. Of course, all this stuff should really be RAII-ified... :-)

Shreeshrii commented 7 years ago

Here is some backtrace info from gdb.

It is easy to reproduce the problem. Give a non-existant filename as input for tesseract.

gdb --args tesseract lorem1.png lorem

(gdb)
Tesseract Open Source OCR Engine v4.00.00alpha-496-g2b373d1 with Leptonica
506         bool succeed = api.ProcessPages(image, NULL, 0, renderers[0]);
(gdb)
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
507         if (!succeed) {
(gdb) backtrace
#0  main (argc=<optimized out>, argv=0x7ffff5d1c0f8) at tesseractmain.cpp:507

(gdb) step
508           fprintf(stderr, "Error during processing.\n");
(gdb)
fprintf (__fmt=0x4037bc "Error during processing.\n", __stream=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:98
98                              __va_arg_pack ());
------------------------------------------------------------
(gdb) backtrace
#0  pthread_mutex_lock (mutex=0x7f5e2efaec60 <tesseract::tprintfMutex>) at forward.c:192
#1  0x00007f5e2ea8dc0f in tprintf_internal (
    format=format@entry=0x7f5e2eab8510 "ObjectCache(%p)::~ObjectCache(): WARNING! LEAK! object %p still has count %d (id %s)\n") at tprintf.cpp:42
#2  0x00007f5e2e9f19b9 in ~ObjectCache (this=0x7f5e2ef9cda0 <tesseract::Dict::GlobalDawgCache()::cache>, __in_chrg=<optimized out>)
    at ../ccutil/object_cache.h:42
#3  tesseract::DawgCache::~DawgCache (this=0x7f5e2ef9cda0 <tesseract::Dict::GlobalDawgCache()::cache>, __in_chrg=<optimized out>) at dawg_cache.h:30
#4  0x00007f5e2daac1a9 in __run_exit_handlers (status=1, listp=0x7f5e2de2e6c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true)
    at exit.c:82
#5  0x00007f5e2daac1f5 in __GI_exit (status=<optimized out>) at exit.c:104
#6  0x00000000004022fc in main (argc=<optimized out>, argv=0x7ffff5d1c0f8) at tesseractmain.cpp:435

------------------------
(gdb) backtrace
#0  _IO_no_init (fp=fp@entry=0x7ffff5d1bc00, flags=flags@entry=32768, orientation=orientation@entry=-1, wd=wd@entry=0x0, jmp=jmp@entry=0x0)
    at genops.c:644
#1  0x00007f5e2db79129 in ___vsnprintf_chk (
    s=s@entry=0x7f5e2efeef80 <tprintf_internal(char const*, ...)::msg> "Tesseract Open Source OCR Engine v4.00.00alpha-496-g2b373d1 with Leptonica\n",
 maxlen=<optimized out>, maxlen@entry=65536, flags=flags@entry=1, slen=slen@entry=65537,
    format=format@entry=0x7f5e2eab8510 "ObjectCache(%p)::~ObjectCache(): WARNING! LEAK! object %p still has count %d (id %s)\n",
    args=args@entry=0x7ffff5d1bd78) at vsnprintf_chk.c:53
#2  0x00007f5e2ea8dc59 in vsnprintf (__ap=0x7ffff5d1bd78,
    __fmt=0x7f5e2eab8510 "ObjectCache(%p)::~ObjectCache(): WARNING! LEAK! object %p still has count %d (id %s)\n", __n=65536,
    __s=0x7f5e2efeef80 <tprintf_internal(char const*, ...)::msg> "Tesseract Open Source OCR Engine v4.00.00alpha-496-g2b373d1 with Leptonica\n")
    at /usr/include/x86_64-linux-gnu/bits/stdio2.h:78
#3  tprintf_internal (format=format@entry=0x7f5e2eab8510 "ObjectCache(%p)::~ObjectCache(): WARNING! LEAK! object %p still has count %d (id %s)\n")
    at tprintf.cpp:56
#4  0x00007f5e2e9f19b9 in ~ObjectCache (this=0x7f5e2ef9cda0 <tesseract::Dict::GlobalDawgCache()::cache>, __in_chrg=<optimized out>)
    at ../ccutil/object_cache.h:42
#5  tesseract::DawgCache::~DawgCache (this=0x7f5e2ef9cda0 <tesseract::Dict::GlobalDawgCache()::cache>, __in_chrg=<optimized out>) at dawg_cache.h:30
#6  0x00007f5e2daac1a9 in __run_exit_handlers (status=1, listp=0x7f5e2de2e6c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true)
    at exit.c:82
#7  0x00007f5e2daac1f5 in __GI_exit (status=<optimized out>) at exit.c:104
#8  0x00000000004022fc in main (argc=<optimized out>, argv=0x7ffff5d1c0f8) at tesseractmain.cpp:435
(gdb) next
rfschtkt commented 7 years ago

Well, I was thinking more along the line of inspecting the offending object. Apparently fixing what I saw in Dict::load() didn't solve the problem, so I'll try gdb myself. Unfortunately ./configure --enable-debug doesn't seem to work, because there are -O2 arguments after the -O0 ones, and "If you use multiple -O options, with or without level numbers, the last such option is the one that is effective.", so my workaround for that is to edit configure and configure.ac.

Shreeshrii commented 7 years ago

Thanks. I had also noticed the -O2 and -O0 combinations while building with enable-debug, and was going to pose a question, since i dont know about it.

On May 12, 2017 8:19 PM, "rfschtkt" notifications@github.com wrote:

Well, I was thinking more along the line of inspecting the offending object. Apparently fixing the problem I perceived didn't solve the problem, so I'll try gdb myself. Unfortunately ./configure --enable-debug doesn't seem to work, because there are -O2 arguments after the -O0 ones, and the last one in the room wins, so my workaround for that is to edit configure and configure.ac. Stay tuned...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/529#issuecomment-301097602, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o6xmD6TVpQL98C2obU2Us2UcXXe6ks5r5HFlgaJpZM4LDjOe .

rfschtkt commented 7 years ago

Intermediate result of analysis: the program seems to be using Dict::GlobalDawgCache(), which has static duration, but the Dict using the cache is not deleted before the program exit()s. To be continued later...

stweil commented 7 years ago

My debug builds avoid the -O0 / -O2 problem like this:

mkdir -p bin/debug
cd bin/debug
# Disable parts which are not needed for debugging (shorter build time)
# and don't use a shared library for Tesseract (easier debugging).
# Avoid -O2 compiler option.
../../configure  --enable-debug --disable-shared --disable-static CXXFLAGS="-g"
make
cd ../..
gdb --args bin/debug/api/tesseract [...]

I still did not find a correct fix for configure.ac.

rfschtkt commented 7 years ago

I have some evidence (in a new branch issue529 on my own fork, unless there's a better way to present that?), currently modifying the messages coming from the global cache. It mentions "workaround" here and there, but perhaps a cleaner solution can be found.

stweil commented 7 years ago

Valgrind output for the test case (after PR #912 was applied):

 HEAP SUMMARY:
     in use at exit: 16,109,940 bytes in 4 blocks
   total heap usage: 666,366 allocs, 666,362 frees, 179,459,012 bytes allocated

 Searching for pointers to 4 not-freed blocks
 Checked 19,243,304 bytes

 8 bytes in 1 blocks are still reachable in loss record 1 of 4
    at 0x4C2BBAF: malloc (vg_replace_malloc.c:299)
    by 0x5D8C688: gomp_malloc (alloc.c:37)
    by 0x5D9B867: gomp_init_num_threads (proc.c:91)
    by 0x5D8ACC5: initialize_env (env.c:1208)
    by 0x400F649: call_init.part.0 (dl-init.c:72)
    by 0x400F75A: call_init (dl-init.c:30)
    by 0x400F75A: _dl_init (dl-init.c:120)
    by 0x4000CD9: ??? (in /lib/x86_64-linux-gnu/ld-2.24.so)
    by 0x2: ???
    by 0xFFF000412: ???
    by 0xFFF00043B: ???
    by 0xFFF00043D: ???

 12 bytes in 1 blocks are indirectly lost in loss record 2 of 4
    at 0x4C2BBAF: malloc (vg_replace_malloc.c:299)
    by 0x2A8339: alloc_string(int) (memry.cpp:32)
    by 0x2AAEF0: STRING::AllocData(int, int) (strngs.cpp:55)
    by 0x2AB120: STRING::STRING(STRING const&) (strngs.cpp:114)
    by 0x234CBA: tesseract::Dawg::Dawg(tesseract::DawgType, STRING const&, PermuterType, int) (dawg.h:209)
    by 0x31149F: tesseract::SquishedDawg::SquishedDawg(tesseract::DawgType, STRING const&, PermuterType, int) (dawg.h:416)
    by 0x311260: tesseract::DawgLoader::Load() (dawg_cache.cpp:91)
    by 0x311C6F: _TessMemberResultCallback_0_0<true, tesseract::Dawg*, tesseract::DawgLoader>::Run() (tesscallback.h:145)
    by 0x311787: tesseract::ObjectCache<tesseract::Dawg>::Get(STRING, TessResultCallback<tesseract::Dawg*>*) (object_cache.h:78)
    by 0x3110DC: tesseract::DawgCache::GetSquishedDawg(STRING const&, tesseract::TessdataType, int, tesseract::TessdataManager*) (dawg_cache.cpp:51)
    by 0x231A97: tesseract::Dict::Load(STRING const&, tesseract::TessdataManager*) (dict.cpp:242)
    by 0x20B765: tesseract::Wordrec::program_editup(char const*, tesseract::TessdataManager*, tesseract::TessdataManager*) (tface.cpp:54)

 100 (88 direct, 12 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 4
    at 0x4C2C21F: operator new(unsigned long) (vg_replace_malloc.c:334)
    by 0x31123F: tesseract::DawgLoader::Load() (dawg_cache.cpp:91)
    by 0x311C6F: _TessMemberResultCallback_0_0<true, tesseract::Dawg*, tesseract::DawgLoader>::Run() (tesscallback.h:145)
    by 0x311787: tesseract::ObjectCache<tesseract::Dawg>::Get(STRING, TessResultCallback<tesseract::Dawg*>*) (object_cache.h:78)
    by 0x3110DC: tesseract::DawgCache::GetSquishedDawg(STRING const&, tesseract::TessdataType, int, tesseract::TessdataManager*) (dawg_cache.cpp:51)
    by 0x231A97: tesseract::Dict::Load(STRING const&, tesseract::TessdataManager*) (dict.cpp:242)
    by 0x20B765: tesseract::Wordrec::program_editup(char const*, tesseract::TessdataManager*, tesseract::TessdataManager*) (tface.cpp:54)
    by 0x18F69F: tesseract::Tesseract::init_tesseract_internal(char const*, char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, tesseract::TessdataManager*) (tessedit.cpp:412)
    by 0x18F2D2: tesseract::Tesseract::init_tesseract(char const*, char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, tesseract::TessdataManager*) (tessedit.cpp:324)
    by 0x12D377: tesseract::TessBaseAPI::Init(char const*, int, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, bool (*)(STRING const&, GenericVector<char>*)) (baseapi.cpp:326)
    by 0x12D0C0: tesseract::TessBaseAPI::Init(char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool) (baseapi.cpp:284)
    by 0x12BB90: main (tesseractmain.cpp:434)

 16,109,832 bytes in 1 blocks are possibly lost in loss record 4 of 4
    at 0x4C2C93F: operator new[](unsigned long) (vg_replace_malloc.c:423)
    by 0x3102E0: tesseract::SquishedDawg::read_squished_dawg(tesseract::TFile*) (dawg.cpp:330)
    by 0x3114D4: tesseract::SquishedDawg::Load(tesseract::TFile*) (dawg.h:439)
    by 0x311277: tesseract::DawgLoader::Load() (dawg_cache.cpp:92)
    by 0x311C6F: _TessMemberResultCallback_0_0<true, tesseract::Dawg*, tesseract::DawgLoader>::Run() (tesscallback.h:145)
    by 0x311787: tesseract::ObjectCache<tesseract::Dawg>::Get(STRING, TessResultCallback<tesseract::Dawg*>*) (object_cache.h:78)
    by 0x3110DC: tesseract::DawgCache::GetSquishedDawg(STRING const&, tesseract::TessdataType, int, tesseract::TessdataManager*) (dawg_cache.cpp:51)
    by 0x231A97: tesseract::Dict::Load(STRING const&, tesseract::TessdataManager*) (dict.cpp:242)
    by 0x20B765: tesseract::Wordrec::program_editup(char const*, tesseract::TessdataManager*, tesseract::TessdataManager*) (tface.cpp:54)
    by 0x18F69F: tesseract::Tesseract::init_tesseract_internal(char const*, char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, tesseract::TessdataManager*) (tessedit.cpp:412)
    by 0x18F2D2: tesseract::Tesseract::init_tesseract(char const*, char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, tesseract::TessdataManager*) (tessedit.cpp:324)
    by 0x12D377: tesseract::TessBaseAPI::Init(char const*, int, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, bool (*)(STRING const&, GenericVector<char>*)) (baseapi.cpp:326)

 LEAK SUMMARY:
    definitely lost: 88 bytes in 1 blocks
    indirectly lost: 12 bytes in 1 blocks
      possibly lost: 16,109,832 bytes in 1 blocks
    still reachable: 8 bytes in 1 blocks
         suppressed: 0 bytes in 0 blocks

 ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
 ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

The output was generated using valgrind --verbose --track-origins=yes --leak-check=full --show-leak-kinds=all bin/debug/x86_64-linux-gnu/api/tesseract a b.

rfschtkt commented 7 years ago

The thing is that exit() was inherited from C, and all kinds of things are hooked up to atexit() (even variables with static storage duration), but not automatic variables like tesseract::TessBaseAPI api;. Inside main() you could replace it with plain return, but elsewhere I think you have to avoid exit() and instead, until the program becomes exception-safe (RAII!) with a global try/catch in main(), perhaps use quick_exit(). Or adapt the code I wrote for evidence as a workaround.

Incidentally, main() has the following comment (although I have no idea why anybody would care about leaked STRING objects, as compared to, e.g., open handles to files, or the subject of this issue):

  /* main() calls functions like ParseArgs which call exit().
   * This results in memory leaks if vars_vec and vars_values are
   * declared as auto variables (destructor is not called then). */
  static GenericVector<STRING> vars_vec;
  static GenericVector<STRING> vars_values;

Well, I guess it might matter to diagnostic tools like Valgrind, but I suppose they're empty now.

stweil commented 7 years ago

When tools like Valgrind are used to search for critical memory leaks, any memory leak is bad because it creates a warning which has to be analyzed. Example: Before PR #912 there were 40,102 allocated blocks at program termination, after that PR there remain 3 blocks (see above).

LEAK SUMMARY:
   definitely lost: 0 bytes in 0 blocks
   indirectly lost: 0 bytes in 0 blocks
     possibly lost: 0 bytes in 0 blocks
   still reachable: 41,740,480 bytes in 40,102 blocks
                      of which reachable via heuristic:
                        newarray           : 4,147,800 bytes in 4,350 blocks
        suppressed: 0 bytes in 0 blocks
rfschtkt commented 7 years ago

I fully agree, and I should have guessed that it had taken some effort to get it down to 4 blocks.

Anyway, after my latest commit I don't get the messages anymore.

(Added) What's the 2 all about? Is anybody using it, is it documented? Otherwise, could it be just EXIT_FAILURE instead? /pedantic

stweil commented 7 years ago

An additional commit in PR #912 fixes this issue.

Shreeshrii commented 7 years ago

Thank you!

This fixes the problem in the testcase I had mentioned Give a non-existant filename as input for tesseract.

shree@ALL-IN-1-TOUCH:/mnt/c/Users/User/shree/tesseract-head$ tesseract lorem1.png lorem
Tesseract Open Source OCR Engine v4.00.00alpha-512-g6bebe71 with Leptonica
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.

However, there are a few other cases where the error was occuring, eg. Missing output directory, please see https://github.com/tesseract-ocr/tesseract/issues/529#issuecomment-269325872

In that case now I get a different error:

 gdb --args tesseract p002-crop.bmp missing/bmp --oem 1 --psm 6 -l hin

(gdb) run
Starting program: /usr/local/bin/tesseract p002-crop.bmp missing/bmp --oem 1 --psm 6 -l hin
warning: Error disabling address space randomization: Success
warning: linux_ptrace_test_ret_to_nx: PTRACE_KILL waitpid returned -1: Interrupted system call
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Tesseract Open Source OCR Engine v4.00.00alpha-512-g6bebe71 with Leptonica
Error during processing.

Program received signal SIGSEGV, Segmentation fault.
_IO_new_fclose (fp=0x0) at iofclose.c:54
54      iofclose.c: No such file or directory.

(gdb) backtrace
#0  _IO_new_fclose (fp=0x0) at iofclose.c:54
#1  0x000000000041a617 in tesseract::TessResultRenderer::~TessResultRenderer (this=0x1bb0160, __in_chrg=<optimized out>) at renderer.cpp:51
#2  0x000000000041b28f in tesseract::TessTextRenderer::~TessTextRenderer (this=0x1bb0160, __in_chrg=<optimized out>) at renderer.h:141
#3  0x000000000041b2be in tesseract::TessTextRenderer::~TessTextRenderer (this=0x1bb0160, __in_chrg=<optimized out>) at renderer.h:141
#4  0x00000000004083d6 in GenericVector<tesseract::TessResultRenderer*>::delete_data_pointers (this=0x9095c0 <main::renderers>)
    at ../ccutil/genericvector.h:874
#5  0x0000000000407f48 in tesseract::PointerVector<tesseract::TessResultRenderer>::clear (this=0x9095c0 <main::renderers>)
    at ../ccutil/genericvector.h:522
#6  0x0000000000407c7c in tesseract::PointerVector<tesseract::TessResultRenderer>::~PointerVector (this=0x9095c0 <main::renderers>,
    __in_chrg=<optimized out>) at ../ccutil/genericvector.h:456
#7  0x00007f7a7a58c1a9 in __run_exit_handlers (status=1, listp=0x7f7a7a90e6c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true)
    at exit.c:82
#8  0x00007f7a7a58c1f5 in __GI_exit (status=<optimized out>) at exit.c:104
#9  0x0000000000407863 in main (argc=9, argv=0x7ffff50bc518) at tesseractmain.cpp:518
(gdb)
Shreeshrii commented 7 years ago

@rnmanhon Please check your testcase also. https://github.com/tesseract-ocr/tesseract/issues/529#issuecomment-299127136

I am not getting the error with the latest code. I had not tested it earlier.


$ cp  p002-crop.bmp /tmp/tess__cwb36mk.bmp
$ tesseract /tmp/tess__cwb36mk.bmp output.txt
Tesseract Open Source OCR Engine v4.00.00alpha-512-g6bebe71 with Leptonica
$
Shreeshrii commented 7 years ago

These changes may also need to be backported for 3.05.

Testcase https://github.com/tesseract-ocr/tesseract/issues/529#issuecomment-294188223

stweil commented 7 years ago

The crash problem is handled in PR #917.

rfschtkt commented 7 years ago

I think that using static to pander to ghost-of-the-past exit() is an abomination. Exiting other than in main is most probably the result of an error condition, where all you're concerned with is returning the error value. The only place where you should be concerned with a clean Valgrind report is inside main(), where you should use return rather than exit() to be compatible with the C++ paradigm of proper stack unwinding.

(Added) Unfortunately there's also ScrollView::Exit(), not sure whether these messages were a problem there?

Shreeshrii commented 7 years ago

@stweil @rfschtkt :~ObjectCache(): WARNING! LEAK! problem is fixed. So I am closing this issue.

Thanks!

mgrint2 commented 7 years ago

C:\Program Files\Tesseract-OCR>tesseract aws.tif aa.pdf Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Error in fopenReadStream: file not found Error in findFileFormat: image file not found Error during processing. ObjectCache(649E5A88)::~ObjectCache(): WARNING! LEAK! object 014DB180 still has count 1 (id \Program Files\Tesseract-OCR\tessdata/eng.traineddatalstm-punc-dawg)

ObjectCache(649E5A88)::~ObjectCache(): WARNING! LEAK! object 014DB228 still has count 1 (id \Program Files\Tesseract-OCR\tessdata/eng.traineddatalstm-word-dawg)

ObjectCache(649E5A88)::~ObjectCache(): WARNING! LEAK! object 01732ED8 still has count 1 (id \Program Files\Tesseract-OCR\tessdata/eng.traineddatalstm-number-daw g) ObjectCache(649E5A88)::~ObjectCache(): WARNING! LEAK! object 048A4618 still has count 1 (id \Program Files\Tesseract-OCR\tessdata/eng.traineddatapunc-dawg) ObjectCache(649E5A88)::~ObjectCache(): WARNING! LEAK! object 014DB278 still has count 1 (id \Program Files\Tesseract-OCR\tessdata/eng.traineddataword-dawg) ObjectCache(649E5A88)::~ObjectCache(): WARNING! LEAK! object 014DB118 still has count 1 (id \Program Files\Tesseract-OCR\tessdata/eng.traineddatanumber-dawg) ObjectCache(649E5A88)::~ObjectCache(): WARNING! LEAK! object 048A89E0 still has count 1 (id \Program Files\Tesseract-OCR\tessdata/eng.traineddatabigram-dawg) ObjectCache(649E5A88)::~ObjectCache(): WARNING! LEAK! object 048A8A80 still has count 1 (id \Program Files\Tesseract-OCR\tessdata/eng.traineddatafreq-dawg)

C:\Program Files\Tesseract-OCR>tesseract -v tesseract 4.00.00alpha leptonica-1.74.1 libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4 .0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0

Please help in this regard installed on windows 7

mgrint2 commented 7 years ago

Installed package tesseract-ocr-setup-4.00.00dev

Shreeshrii commented 7 years ago

Is it from https://github.com/UB-Mannheim/tesseract/wiki

http://digi.bib.uni-mannheim.de/tesseract/

http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Jun 17, 2017 at 5:02 PM, mgrint2 notifications@github.com wrote:

Installed package tesseract-ocr-setup-4.00.00dev

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/529#issuecomment-309209698, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o76bGYd0ZIXge4CI7u_1HWCCyZgiks5sE7ligaJpZM4LDjOe .

Shreeshrii commented 7 years ago

Error in fopenReadStream: file not found Error in findFileFormat: image file not found

You need to give correct location of image file.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Jun 17, 2017 at 5:32 PM, ShreeDevi Kumar shreeshrii@gmail.com wrote:

Is it from https://github.com/UB-Mannheim/tesseract/wiki

http://digi.bib.uni-mannheim.de/tesseract/

http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr- setup-4.00.00dev.exe

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Jun 17, 2017 at 5:02 PM, mgrint2 notifications@github.com wrote:

Installed package tesseract-ocr-setup-4.00.00dev

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/529#issuecomment-309209698, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o76bGYd0ZIXge4CI7u_1HWCCyZgiks5sE7ligaJpZM4LDjOe .

developer239 commented 2 years ago

@Shreeshrii So what is the solution here? I can see link to exe file but what about people on Mac? 🤔 Is there something wrong with the current brew installation?

This is all it takes to cause the memory leak:

    tesseract::TessBaseAPI* tesseractBaseApi;

    ReaderSystem() {
      tesseractBaseApi = new tesseract::TessBaseAPI();

      if (tesseractBaseApi->Init(nullptr, "eng")) {
        fprintf(stderr, "Could not initialize Tesseract.");
        exit(1);
      }
    }

    ~ReaderSystem() {
      tesseractBaseApi->Clear();
      tesseractBaseApi->End();
      delete tesseractBaseApi;
    }

Version:

image