tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.28k stars 9.41k forks source link

Memory access violation when using SetRectangle for 8Bit images #4127

Open AdmiralPellaeon opened 1 year ago

AdmiralPellaeon commented 1 year ago

Current Behavior

Hi

I got a memory access violation when using SetRectangle and an 8Bit image. I only tested the simple example from the tesseract documentation (SetRectangle_example from https://tesseract-ocr.github.io/tessdoc/Examples_C++.html). My image is a PNG file. When saving as 8Bit image I got the crash in debug mode. When I save the image in 32bit, then it works properly.

The VS debugger recognized the violation in the method HistogramRect in the file otsuthr.cpp in line 157: *_int pixel = GET_DATA_BYTE(linedata, (x + left) numchannels + channel);**

I tried to look if there is an error in the formula for calculating the address. But I am not familiar with the Tesseract internals (installed it last week for the first time), so I have no idea what the error causes.

But I think the mentioned method is a little weird. The input parameter _srcpix is a pointer to the small rectangle image part, not to the overall (big) image (width and height have the size of the rectangle). Based on this it uses the WPL of the small rectangle image (_srcwpl) and not of the whole original picture with much bigger resolution. But when calculating the address of the pixels, the algorithm uses the pixel coordinates of the rectangle within the whole big picture (left, top, width, height), these are "global" pixel coodinates. So it seems, that the image is only a shallow structure pointing to the memory of the actual loaded image. But if this is true: why does the line _const l_uint32 linedata = srcdata + y srcwpl; use the WPL of the small rectangle and not of the actual whole image? If it is the memory of the loaded image, then it should be the stride based in the width of the image and not the width of the small rectangle. On the other side: it seems to work properly with 32 bit images. So if the address calculation is wrong, then it should also cause problems with all other channel counts.

These are my thoughts. Hope it helps a little bit.

Best regards

Expected Behavior

No response

Suggested Fix

No response

tesseract -v

Tesseract 5.3.2

Operating System

Windows 10

Other Operating System

No response

uname -a

No response

Compiler

MSVC 2022 17.7.4, x64

CPU

No response

Virtualization / Containers

No response

Other Information

No response

zdenop commented 1 year ago

Please provide also an example image.

AdmiralPellaeon commented 1 year ago

I added two images to the initial post. The parameters for SetRectangle are 613, 2311, 86, 28 to extract the number.

zdenop commented 1 year ago

Where did you add images?

AdmiralPellaeon commented 1 year ago

I added them to the initial post as Test-Images.zip. Don't know why they vanished.

Anyway, I attach the zip again to this post.

Test_images.zip

AdmiralPellaeon commented 1 year ago

Perhaps a small extentions to my inital post and the mentioned HistogramRect function. Below you can see an image of the variable values shortly before the access violation. As mentioned in my post, the calculation of the pixel index seems strange to me. It's a mix of local information and global information. I am not familiar with the data structures in the background (Image data type), but normally I would expect only local pixel coordinates because the image seems to be a stand-alone image copied out of the global image. But I don't know if this is correct. Perhaps contributors with more inside knowledge could check this.

Debugger

zdenop commented 12 months ago

Seems like a bug in Tesseract implementation of OtsuThreshold and 8Bit images:

Felix00643298 commented 11 months ago

Seems like a bug in Tesseract implementation of OtsuThreshold and 8Bit images:

  • If you use another thresholding method (e.g. SetVariable("thresholding_method", "1"))
  • if an image is 4bit or 16bit it is converted to 32bit which did not cause a crash (1bit image is not the subject of thresholding)

Seems like I got a similar issue. I am currently using version 5.3.0. This is the stack info I printed out:

(gdb) bt
#0  0x00007f0f2670e505 in tesseract::HistogramRect(tesseract::Image, int, int, int, int, int, int*) ()
   from /usr/lib/x86_64-linux-gnu/libtesseract.so.5
#1  0x00007f0f2670e795 in tesseract::OtsuThreshold(tesseract::Image, int, int, int, int, std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&) () from /usr/lib/x86_64-linux-gnu/libtesseract.so.5
#2  0x00007f0f266d1c5b in tesseract::ImageThresholder::OtsuThresholdRectToPix(tesseract::Image, tesseract::Image*) const () from /usr/lib/x86_64-linux-gnu/libtesseract.so.5
#3  0x00007f0f266d1de2 in tesseract::ImageThresholder::ThresholdToPix(tesseract::Image*) ()
   from /usr/lib/x86_64-linux-gnu/libtesseract.so.5
#4  0x00007f0f2666d4a4 in tesseract::TessBaseAPI::Threshold(Pix**) ()
   from /usr/lib/x86_64-linux-gnu/libtesseract.so.5
#5  0x00007f0f2666edf7 in tesseract::TessBaseAPI::FindLines() ()
   from /usr/lib/x86_64-linux-gnu/libtesseract.so.5
#6  0x00007f0f26671fe4 in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) ()
   from /usr/lib/x86_64-linux-gnu/libtesseract.so.5
#7  0x00007f0f26673e5f in tesseract::TessBaseAPI::GetUTF8Text() ()
   from /usr/lib/x86_64-linux-gnu/libtesseract.so.5
#8  0x000055db997da672 in ocr_image (image_path="image/test.bmp") at demo.cpp:30
#9  0x000055db997da834 in main () at demo.cpp:57