Open GoogleCodeExporter opened 9 years ago
Your report is strange:
1. you wrote you use tesseract 3.02 - but that version does not provide pdf
output
2. you wrote you use code from SVN around March 2015 - but this project
switched from svn to git in 2014/08 (see [1]) and few week ago we moved to
github.com (see announcement on main page[2]
So it looks like you need to use correct code from correct place...
[1]
https://code.google.com/p/tesseract-ocr/source/list?r=736d32747333a5ff68162975c0
4054bc30792572&r=298e31465a445e54defedd076217ff24b1af3fc2
[2] https://code.google.com/p/tesseract-ocr/
Original comment by zde...@gmail.com
on 22 Jul 2015 at 8:26
It looks like I copied a command line from calling a recent (3.03-ish) build
that could generated PDF. That, however, is just a distraction.
The attached image causes an access violation during the internal segmentation
code - i.e. before recognition, and long before any actual output (hocr, pdf,
text, etc.)
As far as I could tell the Google Code source for Tesseract started getting
mirrored to GitHub sometime around July 2014. However, I see no announcements
that it was clearly moved to GitHub until your "Tesseract moved to github"
posting on June 14, 2015. Based on comparisons, the source from Google Code
(SVN) and from GitHub (GIT) was exactly the same until recently.
I will also test with the posted 3.02.02 binaries - that may remove any issue
of which source was used.
Finally - this bug is apparent through code inspection. The logic in
TextlineProjection works on a line of pixels. The calling code selects a line
segment to analyze and calls MeanPixelsInLineSegment() with the current line
and then with other adjacent lines chosen by offsetting x or y by +2/-2 +1/-3.
When analyzing a horizontal line at y=0, the adjacent line where y=-2 will be
trying to read pixel data outside the image buffer - which causes an access
violation unless the memory happens (lucky?) to be readable.
Original comment by rtaylor...@gmail.com
on 22 Jul 2015 at 6:38
Ok, after some work I traced this to a Microsoft Visual Studio 2013 Community
compiler optimization bug that was skipping operation of calls to
TruncateToImageBounds() inside TextlineProjection.cpp. This allowed data
access to memory outside of the image's buffer and, usually, an access
violation.
This was hard to find because the problem didn't happen in debug mode (no
optimization) and would cease happening if I added any kind of program logic
before or after the calls to the Truncate...() method. It also only occurred
in 32-bit code - no problem in 64-bit build. Also, code would work for some
images alone but not when those images were being processed among multiple
threads running Tesseract. An annoying problem to isolate (as are many
compiler-optimization problems).
For our build (still using VS2013) we used pragmas to disable optimization for
the TruncateToImageBounds() method - that seems to work based on our testing.
I think that the VS2015 Community Edition compiler fixes this - they claim to
have fixed "500 compiler bugs" (but no specifics that I can find yet). Tests,
so far, aren't showing this problem.
This issue can be closed.
Original comment by rtaylor...@gmail.com
on 6 Aug 2015 at 3:43
Thanks for info. Do you plan to test VS2015? Will you post also pragmas that
you used to solve this problem?
Original comment by zde...@gmail.com
on 6 Aug 2015 at 8:00
YW
I did some minimal testing with VS2015, but it was using current code instead
of 3.02.02 code where I isolated the original problem. For the moment we are
still building with VS2013, so thorough VS2015 testing may not happen soon.
One of our customers reported access violation crashes in an earlier version we
built with VS2008, but we haven't gotten enough feedback from them to certify
that they were encountering the same problems.
In .../textord/textlineprojection.cpp I added VS pragma statements like this
(starting @ line 752)
#pragma optimize("g", off)
// Helper truncates the TPOINT to be within the pix_.
void TextlineProjection::TruncateToImageBounds(TPOINT* pt) const {
pt->x = ClipToRange<int>(pt->x, 0, pixGetWidth(pix_) - 1);
pt->y = ClipToRange<int>(pt->y, 0, pixGetHeight(pix_) - 1);
}
#pragma optimize( "", on )
This turns global (g) optimization off for the TruncateToImageBounds() method
only. I tried disabling optimization at a lower level (i.e. for ClipToRange()
function), but that didn't eliminate the problem.
To make these changes cross-platform you'd want to add some #ifdef brackets
around each pragma so that it is used only when building in the Visual Studio
tool chain.
-= Rich
Original comment by rtaylor...@gmail.com
on 11 Aug 2015 at 6:10
Thanks I committed to github.com:
https://github.com/tesseract-ocr/tesseract/commit/9d359cf58a920ad068a3a4b159e6c3
e3b0511f8b
Original comment by zde...@gmail.com
on 16 Aug 2015 at 7:43
Original issue reported on code.google.com by
rtaylor...@gmail.com
on 8 Jul 2015 at 10:53Attachments: