Open MerlijnWajer opened 3 years ago
@MerlijnWajer, did you replace the image? I just loaded the image from the URL above, and it works because the size is 1706 x 24795 which is supported by the current tesseract.
Leptonica uses l_int32
for image dimensions.
Yes, unfortunately someone changed the image to make it work for the existing item I linked to. I'll try to provide another sample file.
Here's another example:
It looks like with 57b79742920cdda6d72e4fd7d0cab218db22f08b
this limit is removed, and now the output is:
$ tesseract -c thresholding_method=2 /tmp/সহজ_কুরআন_শিক্ষা_0005.jp2 -
terminate called after throwing an instance of 'std::length_error'
what(): cannot create std::vector larger than max_size()
Several Tesseract classes are currently limited to images with a maximum width and heigth of 32767 (INT16_MAX) because they use int16_t
coordinates. Here is a list of classes identified so far:
TPOINT
, BLOCK
, PDBLK
, ICOORD
, ICOORDELT
, TBOX
, OL_BUCKETS
For some of those it might be possible to replace int16_t
by uint16_t
which would raise the limit for image dimensions to 65535, but some also use negative values. So maybe we have to use int32_t
(which is compatible with Leptonica).
@MerlijnWajer, if you want you can test pull request #3435 for your large images. I still don't consider that as stable code, so would not use it for production of normal sized images.
Great! I'll try to do this next week.
Environment
Linux gentoo-x230 5.6.18-grsec #2 SMP Tue Jul 7 18:17:17 CEST 2020 x86_64 Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz GenuineIntel GNU/Linux
Current Behavior:
On large images, Tesseract fails like this:
This is the image in question (large image!): https://archive.org/download/manualzz-id-765154/765154_jp2.zip/765154_jp2/765154_0017.jp2
Expected Behavior:
Tesseract would process the image without erroring out.
Comments
I don't know where exactly this limitation comes from. I see that the specific error comes from the Otsu thresholding code, but I am not sure if the limit of
2^15
(INT16_MAX) limit is actually also a leptonica maximum size limit.Perhaps this is not considered a problem and the bug can be closed, but as it stands I am not sure what the best practice would be to OCR the image linked above. Perhaps the limit can be raised some, to
2^16
?