Tesseract r714 is broken on Mac OS X 10.5.8 (PPC)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. install Leptonica and Tesseract

2. Command: tesseract g11071.gif text.txt -l deu

What is the expected output?
Terminal:

Tesseract Open Source OCR Engine v3.02 with Leptonica

Output file:
jagt, sie_ kön nen hier nicht bei uns woh   

What do you see instead?
Terminal:

Error loading shape table!
Tesseract Open Source OCR Engine v3.02 with Leptonica
index >= 0 && index < size_used_:Error:Assert failed:in file 
../ccutil/genericvector.h, line 512
Bus error

Output file: NO

What version of the product are you using?
Tesseract tesseract 3.02 (r714),
leptonica-1.68 (libgif 4.1.6 : libjpeg 8d : libpng 1.4.9 : libtiff 3.9.5 : zlib 
1.2.6)

On what operating system?
Mac OS X 10.5.8 (PowerPC processor)

Please provide any additional information below.

The first message indicates that the shape table loading failed on Mac/PowerPC. 
It might be a endianess issue. The corresponding place in the source is at:

../classify/adaptmatch.cpp lines 563-565

The reason for the 2nd one (ASSERT triggering) is completely opaque to me.

I could try to find out more (I have developer skills) if someone would point 
me where to start.

BTW, the version 3.01 from macports doesn't work as well due to a related 
reason (ASSERT triggered)...

Original issue reported on code.google.com by maximums...@googlemail.com on 30 Mar 2012 at 1:19

GoogleCodeExporter commented 9 years ago

Correction: the intput file's name should be g11071.tif (see attachment).
Sorry
Best regards

Original comment by maximums...@googlemail.com on 30 Mar 2012 at 1:40

Attachments:

g11071.tif

GoogleCodeExporter commented 9 years ago

Ok, it seems I'm right stating that it's an issue related to endianess (as you 
already know PowerPC processor is big-endian).
I quickly changed the related code (../classify/adaptmatch.cpp) to switch off 
byte swapping as follows:

Original (../classify/adaptmatch.cpp, lines 563-565):

if (!shape_table_->DeSerialize(tessdata_manager.swap(),
                                     tessdata_manager.GetDataFilePtr())) {
        tprintf("GRRR Error loading shape table!\n");

Patch:
if (!shape_table_->DeSerialize(false, tessdata_manager.GetDataFilePtr())) {
        tprintf("GRRR Error loading shape table!\n");

After recompiling and running I'm getting the following output in the terminal:

Tesseract Open Source OCR Engine v3.02 with Leptonica
Segmentation fault

Arrrgghhh...

Original comment by maximums...@googlemail.com on 30 Mar 2012 at 2:09

GoogleCodeExporter commented 9 years ago

I have same problem with same parameters(leptonica-1.68, etc.) on Windows-7 
32bit system. Tesseract built using instructions (Visual Studio 2008 Developer 
Notes) on Microsoft Visual C++ 2008 Express Edition with SP1 - ENU.

Original comment by povver...@gmail.com on 2 Apr 2012 at 3:01

GoogleCodeExporter commented 9 years ago

Thank you for your reply.

If you're experiencing the same problem on Windows then it can be hardly an 
endianess-related issue.
There seems to be something wrong with data serialization.

Can someone from the core development team comment on this issue?
Any suggestions?

Original comment by maximums...@googlemail.com on 2 Apr 2012 at 10:58

GoogleCodeExporter commented 9 years ago

Tested on Windows 7 64bit, using VS2008 SP1 (32-bit compiler), DLL_Debug & 
DLL_Release configurations, r718 on supplied g11071.tif image. I get the 
expected output with no assert errors.

Original comment by tomp2...@gmail.com on 4 Apr 2012 at 10:27

GoogleCodeExporter commented 9 years ago

Also tested on Ubuntu 11.10 Intel 32bit, r718 and I get the expected output. So 
it seems like this is indeed a Mac OSX specific problem.

Original comment by tomp2...@gmail.com on 4 Apr 2012 at 11:05

GoogleCodeExporter commented 9 years ago

First of all, thank you for looking into that!

I tested the whole thing on Ubuntu 10.10, Intel x86 32bit, r718 as well and got 
the expected output.

But not for Mac OS X PowerPC! Below what I did:
- svn update to r718
- tried to build with the newer GCC compiler = GCC 4.4 from macports (my first 
build were accomplished using the Apple's ancient GCC 4.01)

I still get:

Error loading shape table!
Tesseract Open Source OCR Engine v3.02 with Leptonica
index >= 0 && index < size_used_:Error:Assert failed:in file 
../ccutil/genericvector.h, line 512
Bus error

As already mentioned I'd like to look closer into the issue.

Can someone point me in the right direction of how to tweak the configure 
script in order to make a debug build (with debug symbols suitable for 
debugging with GDB)?

Neither Tesseract's configure does support --enable-debug nor CFLAGS & CXXFLAGS 
produce desired output...

Thanks in advance

Original comment by maximums...@googlemail.com on 5 Apr 2012 at 12:02

GoogleCodeExporter commented 9 years ago

tesseract did not produce (with automake) "release" version so there is no 
reason to make "--enable-debug" (e.g. "Error:Assert" is available in "debug" 
mode). Debug symbol are presented (if you do not run "make install-strip"). 
("-g -O2") see: 
http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html#Debugging-Options

CFLAGS & CXXFLAGS do not produce anything. They are environment variables that 
effect compiler. Try to use following command to see how they work:
CFLAGS="-g3 -O0" CXXFLAGS='-g3 -O0' ./configure

Original comment by zde...@gmail.com on 5 Apr 2012 at 7:09

GoogleCodeExporter commented 9 years ago

Recompiled with CFLAGS="-g3 -O0" CXXFLAGS='-g3 -O0' ./configure --disable-shared

The clue was the option "--disable-shared". Now I can use GDb on it.

Results will be reported soon.
Thanks

Original comment by maximums...@googlemail.com on 5 Apr 2012 at 10:57

GoogleCodeExporter commented 9 years ago

Yeppp, I found out where the problem resides.

The following code line causes invalid shape data reading:

../classify/shapetable.cpp, line 67, Shape::DeSerialize()

if (fread(&unichars_sorted_, sizeof(unichars_sorted_), 1, fp) != 1)
    return false;

The problem caused by "sizeof(unichars_sorted_)" operation that proceeds on 
assumption that the "bool" data type is ONE BYTE wide (int8_t). But on PowerPC 
the "bool" data type is FOUR BYTES wide (int32_t). Below the proof (GDB 
command):

On Ubuntu, x86, 32bit: print sizeof(bool) = 1
On MacOSX, PPC, 64bit: print sizeof(bool) = 4

Trying to read data serialized with bool=1 (for example on x86) will fail on 
de-serializing with bool=4 on PowerPC.
This unfortunately makes the whole data serialization/de-serialization 
completely broken on systems with bool > 1byte and should be quickly fixed IMHO!

I'm not sure what the proper fix should look like. 
One can setup the compiler to generate one-byte bools or hardcode number of 
bytes to read like that:

if (fread(&unichars_sorted_, 1, 1, fp) != 1)
    return false;

I suspect there are more such places in the code...

Comments from core development team would be highly appreciated!
Best regards
Maxim

Original comment by maximums...@googlemail.com on 5 Apr 2012 at 3:00

GoogleCodeExporter commented 9 years ago

Ping?

@previous_post

Thank you
Best regards

Original comment by maximums...@googlemail.com on 9 Apr 2012 at 9:37

GoogleCodeExporter commented 9 years ago

A patch fixing this issue has been submitted to the developer mailing list.

Waiting for response...

Original comment by maximums...@googlemail.com on 13 Apr 2012 at 11:48

GoogleCodeExporter commented 9 years ago

well, mailing list is moderated, so it is not visible. If you have patch, you 
need to put it to issue anyway.

Original comment by zde...@gmail.com on 13 Apr 2012 at 3:38

GoogleCodeExporter commented 9 years ago

Ok, @attached_patch.

Below a short description of the issue:

The size of the BOOL variable type can vary based on platform/architecture. 
Using sizeof(bool) produces therefore platform-dependent behavior. It can be 
undesirable in the case of data file exchange between different 
platforms/architectures.
In the case of the issue 669 a file produced on x86 arch cannot be readed 
properly on PowerPC due to the problem described above.

The attached patch removes this dependency by ensuring that the sorted flag 
written to/read from file will be always ONE byte wide.

The patch is an unified DIFF as produced by "svn diff"...

Please review and commit.
Best regards
Maxim

Original comment by maximums...@googlemail.com on 13 Apr 2012 at 3:53

Attachments:

issue669fix.patch

GoogleCodeExporter commented 9 years ago

please test r722

Original comment by zde...@gmail.com on 17 Apr 2012 at 5:35

GoogleCodeExporter commented 9 years ago

Thank you for the fix.

While Shape::DeSerialize works as expected, Shape::Serialize remains broken on 
PowerPC. Let me explain why.

As stated above, the bool type is stored as 32bit integer on PowerPC where the 
less significant byte contains the actual value (true/false) and three others 
are set to NULL.

The chart below shows data layout and transfer between memory and register for 
Intel x86(little-endian) and PowerPC(big-endian):

----- Memory------------ Register-------------
x86: 0x01 ZZ YY XX       rEAX = 0xXX YY ZZ 01
ppc: 0xXX YY ZZ 01       r31  = 0xXX YY ZZ 01

XYZ are padding bytes with to meaning (set to NULL).

The serialization code

if (fwrite(&unichars_sorted_, 1, 1, fp) != 1) return false;

will ALWAYS write NULL into the resulted data stream because the first byte in 
the memory is one of the padding bytes (X). For PowerPC, you have to modify the 
code in order to write the 4th byte instead!

Fortunately, the de-serialization code still works. In the case of the 
flag=true fread will place "true" into the 1st memory byte. PowerPC processor 
doesn't reverse memory bytes and will read the value as 0x01000000 which is 
still true because it's != 0.

There are two possible solutions:

- change the variable unichars_sorted_ from bool to inT8 OR
- reformat the bool into inT8 as I did in my previous patch, that is:

inT8 temp;
temp = !!unichars_sorted_; // truncate the value in the machine-independent way
if (fwrite(&temp, 1, 1, fp) != 1) return false;

It may look abit ugly but it doesn't impact the underlying code at all. The 1st 
solution (change bool to inT8) may work slower on PowerPC due to alignment 
constraints.

Best regards
Maxim

Original comment by maximums...@googlemail.com on 18 Apr 2012 at 9:38

GoogleCodeExporter commented 9 years ago

I upgraded Leptonica to 1.68 with Macports and then built r722 from the svn 
using a modified port file. I tested Tesseract on a straight, 1 column scan of 
a text file and it worked. The text file generated is a very good.

Original comment by mdbec...@gmail.com on 19 Apr 2012 at 6:24

GoogleCodeExporter commented 9 years ago

Thank you for testing. r722 works in my G5 Mac too so long as I run the 
recognition.

Anyway, the serialization code is broken on PowerPC as described in my previous 
post. You cannot test it until you'll try to create your own trained data. I 
patched the code in order to test Shape::Serialize. It doesn't crash but does 
produce wrong results.

My intention is just to get this issue properly and completely fixed.

Best regards
maxim

Original comment by maximums...@googlemail.com on 19 Apr 2012 at 7:35

GoogleCodeExporter commented 9 years ago

Fix will be checked in soon, and in 3.02 release.
Sorry about the mess-up. No more bools will be serialized directly!
Revised the types of other serialized data in ShapeTable for when int is no 
longer 32 bits.

Original comment by theraysm...@gmail.com on 20 Sep 2012 at 1:21

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r743.

Original comment by theraysm...@gmail.com on 21 Sep 2012 at 3:20

gxrxrdx / tesseract-ocr

Tesseract r714 is broken on Mac OS X 10.5.8 (PPC) #669