justaddcoffee / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Access Violation - reading outside image buffer during line detection #1496

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.  Tesseract 3.02+ command line
2.  "tesseract -l eng Image_crop.png Image pdf"

What is the expected output? What do you see instead?
>  I expect tesseract to run and produce output

> Instead, Tesseract crashes with "ACCESS VIOLATION (0xC0000005)"-type error.

What version of the product are you using? On what operating system?
Seen in Tesseract 3.02.02 and code from SVN around March 2015.
Windows 7
Win32-bit Tesseract builds.

Please provide any additional information below.
- Doesn't happen in 64-bit Windows build (lucky?)

- Attached image has non-white pixels at image edges - this seems to trigger 
this crash bug.

- Access violation occurs in TextlineProjection::MeanPixelsInLineSegment() when 
it calls GET_DATA_BYTE() (~line 550).  This can break when start_pt/end_pt Y 
values = 0 and offset is a negative value.  This can also break when 
start_pt/end_pt Y value = bottom of image and offset is a positive value.  
These conditions lead to an attempted reads of data either before or after the 
image buffer.

- Other problems would occur horizontally (i.e. X value = 0 or right edge of 
image).  In these cases there is less chance of stepping outside the image 
buffer (unless at a corner), but good chance that the algorithm will not read 
the intended data due to wrapping to other image side.

Original issue reported on code.google.com by rtaylor...@gmail.com on 8 Jul 2015 at 10:53

Attachments:

GoogleCodeExporter commented 9 years ago
Your report is strange:
1. you wrote you use tesseract 3.02 - but that version does not provide pdf 
output
2. you wrote you use code from SVN around March 2015 - but this project 
switched from svn to git in 2014/08 (see [1]) and few week ago we moved to 
github.com (see announcement on main page[2]

So it looks like you need to use correct code from correct place...

[1] 
https://code.google.com/p/tesseract-ocr/source/list?r=736d32747333a5ff68162975c0
4054bc30792572&r=298e31465a445e54defedd076217ff24b1af3fc2
[2] https://code.google.com/p/tesseract-ocr/

Original comment by zde...@gmail.com on 22 Jul 2015 at 8:26

GoogleCodeExporter commented 9 years ago
It looks like I copied a command line from calling a recent (3.03-ish) build 
that could generated PDF.  That, however, is just a distraction.  

The attached image causes an access violation during the internal segmentation 
code - i.e. before recognition, and long before any actual output (hocr, pdf, 
text, etc.)

As far as I could tell the Google Code source for Tesseract started getting 
mirrored to GitHub sometime around July 2014.  However, I see no announcements 
that it was clearly moved to GitHub until your "Tesseract moved to github" 
posting on June 14, 2015.  Based on comparisons, the source from Google Code 
(SVN) and from GitHub (GIT) was exactly the same until recently.

I will also test with the posted 3.02.02 binaries - that may remove any issue 
of which source was used.

Finally - this bug is apparent through code inspection.  The logic in 
TextlineProjection works on a line of pixels.  The calling code selects a line 
segment to analyze and calls MeanPixelsInLineSegment() with the current line 
and then with other adjacent lines chosen by offsetting x or y by +2/-2 +1/-3.  
When analyzing a horizontal line at y=0, the adjacent line where y=-2 will be 
trying to read pixel data outside the image buffer - which causes an access 
violation unless the memory happens (lucky?) to be readable.

Original comment by rtaylor...@gmail.com on 22 Jul 2015 at 6:38

GoogleCodeExporter commented 9 years ago
Ok, after some work I traced this to a Microsoft Visual Studio 2013 Community 
compiler optimization bug that was skipping operation of calls to 
TruncateToImageBounds() inside TextlineProjection.cpp.  This allowed data 
access to memory outside of the image's buffer and, usually, an access 
violation.

This was hard to find because the problem didn't happen in debug mode (no 
optimization) and would cease happening if I added any kind of program logic 
before or after the calls to the Truncate...() method.  It also only occurred 
in 32-bit code - no problem in 64-bit build.  Also, code would work for some 
images alone but not when those images were being processed among multiple 
threads running Tesseract.  An annoying problem to isolate (as are many 
compiler-optimization problems).

For our build (still using VS2013) we used pragmas to disable optimization for 
the TruncateToImageBounds() method - that seems to work based on our testing.

I think that the VS2015 Community Edition compiler fixes this - they claim to 
have fixed "500 compiler bugs" (but no specifics that I can find yet).  Tests, 
so far, aren't showing this problem.

This issue can be closed.

Original comment by rtaylor...@gmail.com on 6 Aug 2015 at 3:43

GoogleCodeExporter commented 9 years ago
Thanks for info. Do you plan to test VS2015? Will you post also pragmas that 
you used to solve this problem?

Original comment by zde...@gmail.com on 6 Aug 2015 at 8:00

GoogleCodeExporter commented 9 years ago
YW

I did some minimal testing with VS2015, but it was using current code instead 
of 3.02.02 code where I isolated the original problem.  For the moment we are 
still building with VS2013, so thorough VS2015 testing may not happen soon.

One of our customers reported access violation crashes in an earlier version we 
built with VS2008, but we haven't gotten enough feedback from them to certify 
that they were encountering the same problems.

In .../textord/textlineprojection.cpp I added VS pragma statements like this 
(starting @ line 752)

#pragma optimize("g", off)
// Helper truncates the TPOINT to be within the pix_.
void TextlineProjection::TruncateToImageBounds(TPOINT* pt) const {
  pt->x = ClipToRange<int>(pt->x, 0, pixGetWidth(pix_) - 1);
  pt->y = ClipToRange<int>(pt->y, 0, pixGetHeight(pix_) - 1);
}
#pragma optimize( "", on ) 

This turns global (g) optimization off for the TruncateToImageBounds() method 
only.  I tried disabling optimization at a lower level (i.e. for ClipToRange() 
function), but that didn't eliminate the problem.

To make these changes cross-platform you'd want to add some #ifdef brackets 
around each pragma so that it is used only when building in the Visual Studio 
tool chain.

-= Rich

Original comment by rtaylor...@gmail.com on 11 Aug 2015 at 6:10

GoogleCodeExporter commented 9 years ago
Thanks I committed to github.com:
https://github.com/tesseract-ocr/tesseract/commit/9d359cf58a920ad068a3a4b159e6c3
e3b0511f8b 

Original comment by zde...@gmail.com on 16 Aug 2015 at 7:43