tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.83k stars 9.54k forks source link

Assertion failure with new image and eng+chi_tra fast #4362

Open marcreichman-pfi opened 1 week ago

marcreichman-pfi commented 1 week ago

Current Behavior

This is in the recent main (9f17a3fd) I receive a SIGABRT in Release (SIGILL in Debug) with the eng and chi_tra langages. Both are fast and official.

(gdb) set args ~/dev/testimages/ACCDEE72E33B2C425E597A4411009466.jpg - --tessdata-dir <snip>/tessdata/ -l eng+chi_tra
(gdb) r
Starting program: /root/dev/tesseract/build-debug/bin/tesseract ~/dev/testimages/ACCDEE72E33B2C425E597A4411009466.jpg - --tessdata-dir <snip>/tessdata/ -l eng+chi_tra
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Estimating resolution as 261
Detected 12 diacritics
[New Thread 0x7ffff73c6640 (LWP 5374)]
[New Thread 0x7ffff6bc5640 (LWP 5375)]
[New Thread 0x7ffff63c4640 (LWP 5376)]
!w_it.cycled_list():Error:Assert failed:in file /root/dev/tesseract/src/ccstruct/pageres.cpp, line 1502

Thread 1 "tesseract" received signal SIGILL, Illegal instruction.
tesseract::ERRCODE::error (this=this@entry=0x5555558a1340 <tesseract::ASSERT_FAILED>, caller=caller@entry=0x5555557f9123 "!w_it.cycled_list()", action=action@entry=tesseract::ABORT, format=format@entry=0x5555557f8900 "in file %s, line %d") at /root/dev/tesseract/src/ccutil/errcode.cpp:78
78            __builtin_trap();
(gdb) bt
#0  tesseract::ERRCODE::error (this=this@entry=0x5555558a1340 <tesseract::ASSERT_FAILED>, caller=caller@entry=0x5555557f9123 "!w_it.cycled_list()", action=action@entry=tesseract::ABORT,
    format=format@entry=0x5555557f8900 "in file %s, line %d") at /root/dev/tesseract/src/ccutil/errcode.cpp:78
#1  0x000055555558485c in tesseract::PAGE_RES_IT::DeleteCurrentWord (this=this@entry=0x7fffffffdc00) at /root/dev/tesseract/src/ccstruct/pageres.cpp:1502
#2  0x000055555561a972 in tesseract::Tesseract::recog_all_words (this=0x7ffff73c7010, page_res=0x5555558e18e0, monitor=monitor@entry=0x0, target_word_box=target_word_box@entry=0x0,
    word_config=word_config@entry=0x0, dopasses=dopasses@entry=0) at /root/dev/tesseract/src/ccmain/control.cpp:446
#3  0x00005555555d5553 in tesseract::TessBaseAPI::Recognize (this=this@entry=0x7fffffffe2d0, monitor=monitor@entry=0x0) at /root/dev/tesseract/src/api/baseapi.cpp:833
#4  0x00005555555d57e3 in tesseract::TessBaseAPI::ProcessPage (this=this@entry=0x7fffffffe2d0, pix=0x5555558e2230, page_index=page_index@entry=0,
    filename=filename@entry=0x7fffffffe774 "/root/dev/testimages/ACCDEE72E33B2C425E597A4411009466.jpg", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
    renderer=0x5555558d2740) at /root/dev/tesseract/src/api/baseapi.cpp:1218
#5  0x00005555555d68e4 in tesseract::TessBaseAPI::ProcessPagesInternal (this=this@entry=0x7fffffffe2d0,
    filename=0x7fffffffe774 "/root/dev/testimages/ACCDEE72E33B2C425E597A4411009466.jpg", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
    renderer=0x5555558d2740) at /root/dev/tesseract/src/api/baseapi.cpp:1181
#6  0x00005555555d69ea in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffffffe2d0, filename=<optimized out>, retry_config=retry_config@entry=0x0,
    timeout_millisec=timeout_millisec@entry=0, renderer=<optimized out>) at /root/dev/tesseract/src/api/baseapi.cpp:998
#7  0x000055555556d6c3 in main (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/11/bits/unique_ptr.h:173

Expected Behavior

No sig abort

Suggested Fix

No response

tesseract -v

tesseract 5.5.0-26-g9f17a
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX
 Found SSE4.1
 Found OpenMP 201511

Operating System

Ubuntu 22.04 Jammy

Other Operating System

WSL

uname -a

Linux hostname 5.10.16.3-microsoft-standard-WSL2 #1 SMP Fri Apr 2 22:23:49 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Compiler

GCC 11.4

CPU

Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz

Virtualization / Containers

No response

Other Information

I'm sure this is related to the random generator-covered series of issues (#4361 #4146 #4148 #4270). This is also reproducible in 5.5.0, unlike #4361 which worked on in 5.5.0.

marcreichman-pfi commented 1 week ago

ACCDEE72E33B2C425E597A4411009466

Here is the image for this one, sorry.

stweil commented 1 week ago

There is a heap-use-after-free before the assertion:


Estimating resolution as 261
Detected 12 diacritics
=================================================================
==31201==ERROR: AddressSanitizer: heap-use-after-free on address 0x6080000034b8 at pc 0x55a73474bd12 bp 0x7fffbe0cdab0 sp 0x7fffbe0cdaa8
READ of size 8 at 0x6080000034b8 thread T0
    #0 0x55a73474bd11 in std::__cxx1998::_Base_bitset<1ul>::_M_getword(unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bitset:415:16
    #1 0x55a73474bc82 in std::__cxx1998::bitset<16ul>::_Unchecked_test(unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bitset:1066:24
    #2 0x55a73474bc00 in std::__cxx1998::bitset<16ul>::operator[](unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bitset:1168:16
    #3 0x55a73474bba2 in std::__debug::bitset<16ul>::operator[](unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/debug/bitset:282:16
    #4 0x55a73474b2df in tesseract::WERD::flag(tesseract::WERD_FLAGS) const /tesseract/build/../src/ccstruct/werd.h:129:12
    #5 0x55a7349c0280 in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES*, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) /tesseract/build/../src/ccmain/control.cpp:350:37
    #6 0x55a7346a24af in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) /tesseract/build/../src/api/baseapi.cpp:833:21
    #7 0x55a7346a4b99 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1218:14
    #8 0x55a7346a92b8 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1181:16
    #9 0x55a7346a61f1 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:998:17
    #10 0x55a7346262f2 in main /tesseract/build/../src/tesseract.cpp:867:24
    #11 0x7f8a62f23249 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #12 0x7f8a62f23304 in __libc_start_main csu/../csu/libc-start.c:360:3
    #13 0x55a734563450 in _start (/tesseract/build/tesseract+0x17d9450) (BuildId: 76aacbbd0f98892a9872e3f978f3ed72519cf4ee)

0x6080000034b8 is located 24 bytes inside of 96-byte region [0x6080000034a0,0x608000003500)
freed by thread T0 here:
    #0 0x55a7346218cd in operator delete(void*) (/tesseract/build/tesseract+0x18978cd) (BuildId: 76aacbbd0f98892a9872e3f978f3ed72519cf4ee)
    #1 0x55a734fb773e in tesseract::WERD_RES::Clear() /tesseract/build/../src/ccstruct/pageres.cpp:1130:5
    #2 0x55a734fcb438 in tesseract::WERD_RES::~WERD_RES() /tesseract/build/../src/ccstruct/pageres.cpp:1125:3
    #3 0x55a734fd0bee in tesseract::PAGE_RES_IT::ReplaceCurrentWord(tesseract::PointerVector<tesseract::WERD_RES>*) /tesseract/build/../src/ccstruct/pageres.cpp:1483:3
    #4 0x55a7349b840b in tesseract::Tesseract::classify_word_and_language(int, tesseract::PAGE_RES_IT*, tesseract::WordData*) /tesseract/build/../src/ccmain/control.cpp:1367:14
    #5 0x55a7349bbe84 in tesseract::Tesseract::RecogAllWordsPassN(int, tesseract::ETEXT_DESC*, tesseract::PAGE_RES_IT*, std::__debug::vector<tesseract::WordData, std::allocator<tesseract::WordData> >*) /tesseract/build/../src/ccmain/control.cpp:255:5
    #6 0x55a7349c0125 in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES*, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) /tesseract/build/../src/ccmain/control.cpp:345:10
    #7 0x55a7346a24af in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) /tesseract/build/../src/api/baseapi.cpp:833:21
    #8 0x55a7346a4b99 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1218:14
    #9 0x55a7346a92b8 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1181:16
    #10 0x55a7346a61f1 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:998:17
    #11 0x55a7346262f2 in main /tesseract/build/../src/tesseract.cpp:867:24
    #12 0x7f8a62f23249 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16

previously allocated by thread T0 here:
    #0 0x55a73462106d in operator new(unsigned long) (/tesseract/build/tesseract+0x189706d) (BuildId: 76aacbbd0f98892a9872e3f978f3ed72519cf4ee)
    #1 0x55a734fb5302 in tesseract::ROW_RES::ROW_RES(bool, tesseract::ROW*) /tesseract/build/../src/ccstruct/pageres.cpp:171:21
    #2 0x55a734fb3c97 in tesseract::BLOCK_RES::BLOCK_RES(bool, tesseract::BLOCK*) /tesseract/build/../src/ccstruct/pageres.cpp:109:31
    #3 0x55a734fb32aa in tesseract::PAGE_RES::PAGE_RES(bool, tesseract::BLOCK_LIST*, tesseract::WERD_CHOICE**) /tesseract/build/../src/ccstruct/pageres.cpp:84:13
    #4 0x55a73469f93e in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) /tesseract/build/../src/api/baseapi.cpp:783:13
    #5 0x55a7346a4b99 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1218:14
    #6 0x55a7346a92b8 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:1181:16
    #7 0x55a7346a61f1 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) /tesseract/build/../src/api/baseapi.cpp:998:17
    #8 0x55a7346262f2 in main /tesseract/build/../src/tesseract.cpp:867:24
    #9 0x7f8a62f23249 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16

SUMMARY: AddressSanitizer: heap-use-after-free /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bitset:415:16 in std::__cxx1998::_Base_bitset<1ul>::_M_getword(unsigned long) const
Shadow bytes around the buggy address:
  0x0c107fff8640: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 fa
  0x0c107fff8650: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c107fff8660: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 05
  0x0c107fff8670: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c107fff8680: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 06
=>0x0c107fff8690: fa fa fa fa fd fd fd[fd]fd fd fd fd fd fd fd fd
  0x0c107fff86a0: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c107fff86b0: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c107fff86c0: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c107fff86d0: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 02
  0x0c107fff86e0: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==31201==ABORTING
stweil commented 1 week ago

The current code uses random values to add noise outside of the image. Using a constant instead of the random values might work better (still to try with the other cases):

diff --git a/src/lstm/networkio.cpp b/src/lstm/networkio.cpp
index 3cb068c6..83347260 100644
--- a/src/lstm/networkio.cpp
+++ b/src/lstm/networkio.cpp
@@ -417,7 +417,7 @@ void NetworkIO::Randomize(int t, int offset, int num_features, TRand *randomizer
   if (int_mode_) {
     int8_t *line = i_[t] + offset;
     for (int i = 0; i < num_features; ++i) {
-      line[i] = IntCastRounded(randomizer->SignedRand(INT8_MAX));
+      line[i] = 0;
     }
   } else {
     // float mode.
egorpugin commented 1 week ago

Still it is better to understand what is wrong with using lists. I guess lists usage is incorrect somewhere.

egorpugin commented 1 week ago

Or more in general - fix all other issues around random values and crashes they spotlight.