tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.73k stars 9.54k forks source link

Maximum supported image size #3184

Open MerlijnWajer opened 3 years ago

MerlijnWajer commented 3 years ago

Environment

Current Behavior:

On large images, Tesseract fails like this:

Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Image too large: (2559, 37192)
Error during processing.

This is the image in question (large image!): https://archive.org/download/manualzz-id-765154/765154_jp2.zip/765154_jp2/765154_0017.jp2

Expected Behavior:

Tesseract would process the image without erroring out.

Comments

I don't know where exactly this limitation comes from. I see that the specific error comes from the Otsu thresholding code, but I am not sure if the limit of 2^15 (INT16_MAX) limit is actually also a leptonica maximum size limit.

Perhaps this is not considered a problem and the bug can be closed, but as it stands I am not sure what the best practice would be to OCR the image linked above. Perhaps the limit can be raised some, to 2^16 ?

stweil commented 3 years ago

@MerlijnWajer, did you replace the image? I just loaded the image from the URL above, and it works because the size is 1706 x 24795 which is supported by the current tesseract.

Leptonica uses l_int32 for image dimensions.

MerlijnWajer commented 3 years ago

Yes, unfortunately someone changed the image to make it work for the existing item I linked to. I'll try to provide another sample file.

MerlijnWajer commented 3 years ago

Here's another example:

https://wizzup.org/সহজ_কুরআন_শিক্ষা_0005.jp2

amitdo commented 3 years ago

https://github.com/tesseract-ocr/tesseract/blob/8b0c5405e2fa8cbecb3693c0074c5c8d1ae321b8/src/ccmain/thresholder.cpp#L188-L192

MerlijnWajer commented 3 years ago

It looks like with 57b79742920cdda6d72e4fd7d0cab218db22f08b this limit is removed, and now the output is:

$ tesseract -c thresholding_method=2 /tmp/সহজ_কুরআন_শিক্ষা_0005.jp2 -
terminate called after throwing an instance of 'std::length_error'
  what():  cannot create std::vector larger than max_size()
stweil commented 3 years ago

Several Tesseract classes are currently limited to images with a maximum width and heigth of 32767 (INT16_MAX) because they use int16_t coordinates. Here is a list of classes identified so far:

TPOINT, BLOCK, PDBLK, ICOORD, ICOORDELT, TBOX, OL_BUCKETS

For some of those it might be possible to replace int16_t by uint16_t which would raise the limit for image dimensions to 65535, but some also use negative values. So maybe we have to use int32_t (which is compatible with Leptonica).

stweil commented 3 years ago

@MerlijnWajer, if you want you can test pull request #3435 for your large images. I still don't consider that as stable code, so would not use it for production of normal sized images.

MerlijnWajer commented 3 years ago

Great! I'll try to do this next week.