Closed GoogleCodeExporter closed 9 years ago
Correction: the intput file's name should be g11071.tif (see attachment).
Sorry
Best regards
Original comment by maximums...@googlemail.com
on 30 Mar 2012 at 1:40
Attachments:
Ok, it seems I'm right stating that it's an issue related to endianess (as you
already know PowerPC processor is big-endian).
I quickly changed the related code (../classify/adaptmatch.cpp) to switch off
byte swapping as follows:
Original (../classify/adaptmatch.cpp, lines 563-565):
if (!shape_table_->DeSerialize(tessdata_manager.swap(),
tessdata_manager.GetDataFilePtr())) {
tprintf("GRRR Error loading shape table!\n");
Patch:
if (!shape_table_->DeSerialize(false, tessdata_manager.GetDataFilePtr())) {
tprintf("GRRR Error loading shape table!\n");
After recompiling and running I'm getting the following output in the terminal:
Tesseract Open Source OCR Engine v3.02 with Leptonica
Segmentation fault
Arrrgghhh...
Original comment by maximums...@googlemail.com
on 30 Mar 2012 at 2:09
I have same problem with same parameters(leptonica-1.68, etc.) on Windows-7
32bit system. Tesseract built using instructions (Visual Studio 2008 Developer
Notes) on Microsoft Visual C++ 2008 Express Edition with SP1 - ENU.
Original comment by povver...@gmail.com
on 2 Apr 2012 at 3:01
Thank you for your reply.
If you're experiencing the same problem on Windows then it can be hardly an
endianess-related issue.
There seems to be something wrong with data serialization.
Can someone from the core development team comment on this issue?
Any suggestions?
Original comment by maximums...@googlemail.com
on 2 Apr 2012 at 10:58
Tested on Windows 7 64bit, using VS2008 SP1 (32-bit compiler), DLL_Debug &
DLL_Release configurations, r718 on supplied g11071.tif image. I get the
expected output with no assert errors.
Original comment by tomp2...@gmail.com
on 4 Apr 2012 at 10:27
Also tested on Ubuntu 11.10 Intel 32bit, r718 and I get the expected output. So
it seems like this is indeed a Mac OSX specific problem.
Original comment by tomp2...@gmail.com
on 4 Apr 2012 at 11:05
First of all, thank you for looking into that!
I tested the whole thing on Ubuntu 10.10, Intel x86 32bit, r718 as well and got
the expected output.
But not for Mac OS X PowerPC! Below what I did:
- svn update to r718
- tried to build with the newer GCC compiler = GCC 4.4 from macports (my first
build were accomplished using the Apple's ancient GCC 4.01)
I still get:
Error loading shape table!
Tesseract Open Source OCR Engine v3.02 with Leptonica
index >= 0 && index < size_used_:Error:Assert failed:in file
../ccutil/genericvector.h, line 512
Bus error
As already mentioned I'd like to look closer into the issue.
Can someone point me in the right direction of how to tweak the configure
script in order to make a debug build (with debug symbols suitable for
debugging with GDB)?
Neither Tesseract's configure does support --enable-debug nor CFLAGS & CXXFLAGS
produce desired output...
Thanks in advance
Original comment by maximums...@googlemail.com
on 5 Apr 2012 at 12:02
tesseract did not produce (with automake) "release" version so there is no
reason to make "--enable-debug" (e.g. "Error:Assert" is available in "debug"
mode). Debug symbol are presented (if you do not run "make install-strip").
("-g -O2") see:
http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html#Debugging-Options
CFLAGS & CXXFLAGS do not produce anything. They are environment variables that
effect compiler. Try to use following command to see how they work:
CFLAGS="-g3 -O0" CXXFLAGS='-g3 -O0' ./configure
Original comment by zde...@gmail.com
on 5 Apr 2012 at 7:09
Recompiled with CFLAGS="-g3 -O0" CXXFLAGS='-g3 -O0' ./configure --disable-shared
The clue was the option "--disable-shared". Now I can use GDb on it.
Results will be reported soon.
Thanks
Original comment by maximums...@googlemail.com
on 5 Apr 2012 at 10:57
Yeppp, I found out where the problem resides.
The following code line causes invalid shape data reading:
../classify/shapetable.cpp, line 67, Shape::DeSerialize()
if (fread(&unichars_sorted_, sizeof(unichars_sorted_), 1, fp) != 1)
return false;
The problem caused by "sizeof(unichars_sorted_)" operation that proceeds on
assumption that the "bool" data type is ONE BYTE wide (int8_t). But on PowerPC
the "bool" data type is FOUR BYTES wide (int32_t). Below the proof (GDB
command):
On Ubuntu, x86, 32bit: print sizeof(bool) = 1
On MacOSX, PPC, 64bit: print sizeof(bool) = 4
Trying to read data serialized with bool=1 (for example on x86) will fail on
de-serializing with bool=4 on PowerPC.
This unfortunately makes the whole data serialization/de-serialization
completely broken on systems with bool > 1byte and should be quickly fixed IMHO!
I'm not sure what the proper fix should look like.
One can setup the compiler to generate one-byte bools or hardcode number of
bytes to read like that:
if (fread(&unichars_sorted_, 1, 1, fp) != 1)
return false;
I suspect there are more such places in the code...
Comments from core development team would be highly appreciated!
Best regards
Maxim
Original comment by maximums...@googlemail.com
on 5 Apr 2012 at 3:00
Ping?
@previous_post
Thank you
Best regards
Original comment by maximums...@googlemail.com
on 9 Apr 2012 at 9:37
A patch fixing this issue has been submitted to the developer mailing list.
Waiting for response...
Original comment by maximums...@googlemail.com
on 13 Apr 2012 at 11:48
well, mailing list is moderated, so it is not visible. If you have patch, you
need to put it to issue anyway.
Original comment by zde...@gmail.com
on 13 Apr 2012 at 3:38
Ok, @attached_patch.
Below a short description of the issue:
The size of the BOOL variable type can vary based on platform/architecture.
Using sizeof(bool) produces therefore platform-dependent behavior. It can be
undesirable in the case of data file exchange between different
platforms/architectures.
In the case of the issue 669 a file produced on x86 arch cannot be readed
properly on PowerPC due to the problem described above.
The attached patch removes this dependency by ensuring that the sorted flag
written to/read from file will be always ONE byte wide.
The patch is an unified DIFF as produced by "svn diff"...
Please review and commit.
Best regards
Maxim
Original comment by maximums...@googlemail.com
on 13 Apr 2012 at 3:53
Attachments:
please test r722
Original comment by zde...@gmail.com
on 17 Apr 2012 at 5:35
Thank you for the fix.
While Shape::DeSerialize works as expected, Shape::Serialize remains broken on
PowerPC. Let me explain why.
As stated above, the bool type is stored as 32bit integer on PowerPC where the
less significant byte contains the actual value (true/false) and three others
are set to NULL.
The chart below shows data layout and transfer between memory and register for
Intel x86(little-endian) and PowerPC(big-endian):
----- Memory------------ Register-------------
x86: 0x01 ZZ YY XX rEAX = 0xXX YY ZZ 01
ppc: 0xXX YY ZZ 01 r31 = 0xXX YY ZZ 01
XYZ are padding bytes with to meaning (set to NULL).
The serialization code
if (fwrite(&unichars_sorted_, 1, 1, fp) != 1) return false;
will ALWAYS write NULL into the resulted data stream because the first byte in
the memory is one of the padding bytes (X). For PowerPC, you have to modify the
code in order to write the 4th byte instead!
Fortunately, the de-serialization code still works. In the case of the
flag=true fread will place "true" into the 1st memory byte. PowerPC processor
doesn't reverse memory bytes and will read the value as 0x01000000 which is
still true because it's != 0.
There are two possible solutions:
- change the variable unichars_sorted_ from bool to inT8 OR
- reformat the bool into inT8 as I did in my previous patch, that is:
inT8 temp;
temp = !!unichars_sorted_; // truncate the value in the machine-independent way
if (fwrite(&temp, 1, 1, fp) != 1) return false;
It may look abit ugly but it doesn't impact the underlying code at all. The 1st
solution (change bool to inT8) may work slower on PowerPC due to alignment
constraints.
Best regards
Maxim
Original comment by maximums...@googlemail.com
on 18 Apr 2012 at 9:38
I upgraded Leptonica to 1.68 with Macports and then built r722 from the svn
using a modified port file. I tested Tesseract on a straight, 1 column scan of
a text file and it worked. The text file generated is a very good.
Original comment by mdbec...@gmail.com
on 19 Apr 2012 at 6:24
Thank you for testing. r722 works in my G5 Mac too so long as I run the
recognition.
Anyway, the serialization code is broken on PowerPC as described in my previous
post. You cannot test it until you'll try to create your own trained data. I
patched the code in order to test Shape::Serialize. It doesn't crash but does
produce wrong results.
My intention is just to get this issue properly and completely fixed.
Best regards
maxim
Original comment by maximums...@googlemail.com
on 19 Apr 2012 at 7:35
Fix will be checked in soon, and in 3.02 release.
Sorry about the mess-up. No more bools will be serialized directly!
Revised the types of other serialized data in ShapeTable for when int is no
longer 32 bits.
Original comment by theraysm...@gmail.com
on 20 Sep 2012 at 1:21
This issue was closed by revision r743.
Original comment by theraysm...@gmail.com
on 21 Sep 2012 at 3:20
Original issue reported on code.google.com by
maximums...@googlemail.com
on 30 Mar 2012 at 1:19