tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.12k stars 9.39k forks source link

Integer overflow in overlap computations #1821

Open stweil opened 6 years ago

stweil commented 6 years ago

The functions VCoreOverlap and VSignificantCoreOverlap calculate integer differences which overflow when one of the operands is +-INT32_MAX (for example when median_bottom_ == INT32_MAX). The GNU compiler can build code which detects integer overflow at runtime (compiler option-ftrapv`). Tesseract then gets an unhandled trap and terminates.

Overflows are triggered with this image by running tesseract -l script/Fraktur 0604.jp2 0604.

stweil commented 6 years ago

I am still not sure whether it is an error when median_bottom_ still has its initial value or whether that is something which is normal and which should be handled. In any I expect that the integer overflow will results in wrong layout recognition, so it will be visible in the OCR result.

Maybe somebody finds a simpler test image which triggers the overflow, too. I think it must have a multi column layout. My test image is a little bit large and takes a lot of time for OCR.

stweil commented 6 years ago

It's a pity that -ftrapv costs performance (about 35 % longer execution time according to my tests) – otherwise we could enable it always.

amitdo commented 6 years ago

Some related links:

https://stackoverflow.com/questions/38960763/ftrapv-and-fwrapv-which-is-better-for-efficiency/38960868

http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html

https://bugzilla.mozilla.org/show_bug.cgi?id=1031653

amitdo commented 2 years ago

Did commit 7f911ac5e027ac8a fix this issue?

See also #320.

stweil commented 2 years ago

I am afraid that the current code still does not handle all cases which can result in an integer overflow, so more tests are needed with -ftrapv enabled.