tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.88k stars 9.37k forks source link

Error in boxClipToRectangle: box outside rectangle #427

Open PedroBarcha opened 8 years ago

PedroBarcha commented 8 years ago

Hi there, I've got some specific images that output the following on linux:

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

The pictures get successfully OCRed in tesseract (without great results tho). The biggest problem for me, however, is that in OCRopus they don't even get OCRed.

example5 ghoby30c

Any ideas?

amitdo commented 7 years ago
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

Add a white/black frame to the image and no error messages will appear.

convert  427-1.jpg  -bordercolor White -border 10x10 427-1b.jpg

Strange behaviour...

amitdo commented 7 years ago

The biggest problem for me, however, is that in OCRopus they don't even get OCRed.

This place is for bug reports about Tesseract, not OCRopus.

erikdubbelboer commented 7 years ago

@amitdo I'm getting the same issue just with Tesseract. I'm guessing OCRopus is using Tesseract and that's why he made the issue here.

amitdo commented 7 years ago

I'm guessing OCRopus is using Tesseract

Ocropy (and clstm) does not use Tesseract. A VERY OLD version of Ocropus (0.4) did use Tesseract.

amitdo commented 4 years ago

Similar issues #468 #1601

These error messages are produced by Leptonica.

They are triggered by a call to pixClipBoxToForeground()

https://github.com/DanBloomberg/leptonica/blob/bbe289cf3f0fe368d5b9eac64df2ccd6e9b05c56/src/pix5.c#L1956

https://github.com/tesseract-ocr/tesseract/search?q=pixClipBoxToForeground

amitdo commented 4 years ago

@stweil, this seems like a bug in Tesseract, maybe you can explore it and find its cause.

amitdo commented 4 years ago

https://github.com/tesseract-ocr/tesseract/search?q=pixClipBoxToForeground

I noticed that Tesseract does not check the return value from Leptonica's functions (l_ok).

stweil commented 4 years ago

@stweil, this seems like a bug in Tesseract, maybe you can explore it and find its cause.

It's caused by a box with width / height 0, but as always in Tesseract it is difficult to find the right fix.

Nemesis77swe commented 2 years ago

This error is still present, tried to read an image of 250x50,and got the error..
after a few trials, I found that 250x51 is working, so apparently there's a limit for the smallest size of image

csidirop commented 2 years ago

I have the same issue. I have a software that fetches images via wget and then runs ocr with tesseract on them. I noticed that with some images (or resolutions like I found out) the following error occurs:

Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

I found out that this only occurs at some resolutions. So I wrote a script to check this on an example image. This script decreases successively the resolution of the image and then tries to apply ocr to it with tesseract. The image has a resolution of 2090x1504 pixel.

There are no errors till the height reaches 1578 pixels. Than irregulary some errors and from 1502p nearly for every image. Some images generate several of these errors, eg:

h: 1094
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box

Unlike @Nemesis77swe ,

there's a limit for the smallest size of image

I don't think that there is a limit, I think it's maybe a mathematical issue somewhere in the code which causes a box with width / height of 0 like @stweil stated.

I attached the script and the output and this is the image.


Platform:

Linux notebook63 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Tesseract Version:

tesseract 5.2.0-13-g74e22
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
 Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
csidirop commented 2 years ago

I tried this on an other windows machine in wsl with same results:

Ubuntu 20.04 (on both win machines) and Debian buster showing exact the same outputs.

amitdo commented 2 years ago

@csidirop,

Does adding a white or black border to the image help?

https://github.com/tesseract-ocr/tesseract/issues/427#issuecomment-248153491

If not, post an image that demonstrate the issue.

csidirop commented 2 years ago

Indeed, there are no errors after adding a white border