Open GoogleCodeExporter opened 9 years ago
Your report is strange:
1. you wrote you use tesseract 3.02 - but that version does not provide pdf
output
2. you wrote you use code from SVN around March 2015 - but this project
switched from svn to git in 2014/08 (see [1]) and few week ago we moved to
github.com (see announcement on main page[2]
So it looks like you need to use correct code from correct place...
[1]
https://code.google.com/p/tesseract-ocr/source/list?r=736d32747333a5ff68162975c0
4054bc30792572&r=298e31465a445e54defedd076217ff24b1af3fc2
[2] https://code.google.com/p/tesseract-ocr/
Original comment by zde...@gmail.com
on 22 Jul 2015 at 8:26
It looks like I copied a command line from calling a recent (3.03-ish) build
that could generated PDF. That, however, is just a distraction.
The attached image causes an access violation during the internal segmentation
code - i.e. before recognition, and long before any actual output (hocr, pdf,
text, etc.)
As far as I could tell the Google Code source for Tesseract started getting
mirrored to GitHub sometime around July 2014. However, I see no announcements
that it was clearly moved to GitHub until your "Tesseract moved to github"
posting on June 14, 2015. Based on comparisons, the source from Google Code
(SVN) and from GitHub (GIT) was exactly the same until recently.
I will also test with the posted 3.02.02 binaries - that may remove any issue
of which source was used.
Finally - this bug is apparent through code inspection. The logic in
TextlineProjection works on a line of pixels. The calling code selects a line
segment to analyze and calls MeanPixelsInLineSegment() with the current line
and then with other adjacent lines chosen by offsetting x or y by +2/-2 +1/-3.
When analyzing a horizontal line at y=0, the adjacent line where y=-2 will be
trying to read pixel data outside the image buffer - which causes an access
violation unless the memory happens (lucky?) to be readable.
Original comment by rtaylor...@gmail.com
on 22 Jul 2015 at 6:38
Ok, after some work I traced this to a Microsoft Visual Studio 2013 Community
compiler optimization bug that was skipping operation of calls to
TruncateToImageBounds() inside TextlineProjection.cpp. This allowed data
access to memory outside of the image's buffer and, usually, an access
violation.
This was hard to find because the problem didn't happen in debug mode (no
optimization) and would cease happening if I added any kind of program logic
before or after the calls to the Truncate...() method. It also only occurred
in 32-bit code - no problem in 64-bit build. Also, code would work for some
images alone but not when those images were being processed among multiple
threads running Tesseract. An annoying problem to isolate (as are many
compiler-optimization problems).
For our build (still using VS2013) we used pragmas to disable optimization for
the TruncateToImageBounds() method - that seems to work based on our testing.
I think that the VS2015 Community Edition compiler fixes this - they claim to
have fixed "500 compiler bugs" (but no specifics that I can find yet). Tests,
so far, aren't showing this problem.
This issue can be closed.
Original comment by rtaylor...@gmail.com
on 6 Aug 2015 at 3:43
Thanks for info. Do you plan to test VS2015? Will you post also pragmas that
you used to solve this problem?
Original comment by zde...@gmail.com
on 6 Aug 2015 at 8:00
Original issue reported on code.google.com by
rtaylor...@gmail.com
on 8 Jul 2015 at 10:53Attachments: