tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.27k stars 9.51k forks source link

Invalid characters with Tesseract 5.1.0 and tessdata_fast data (for German version) when using 32-bit Microsoft compiler #3769

Closed krzysiekj94 closed 2 years ago

krzysiekj94 commented 2 years ago

Environment

Current Behavior:

I have the following problem:

  1. I prepared a custom build for Tesseract 5.1.0, so as to generate dlls, which I then use in the project of a 32-bit .exe application.
  2. I prepared the following dependencies with CMake 3.23 (without SW build): a. tesseract 5.1.0, leptonica 1.82.0, libtiff 4.3.0, libjpeg-turbo 2.1.3, zlib 1.2.11, libpng 1.6.37. b. Links to src:
  3. After generating the dependencies, I used them in a wrapper that uses CAPI and generated a dll file (also 32 bit) that I used in the application. The list of all dependencies is as follows: image
  4. In the next step, I performed an OCR test in the application with tessdata germany data - deu.traineddata model: https://github.com/tesseract-ocr/tessdata_fast.
  5. At this point, I noticed inferior recognition quality compared to the Tesseract 4.1.1 version, which I used earlier. image a. test file: test_file
  6. I noticed that there is also a problem with slash, for example: It is then changed to "jj" - see: image
  7. I would like to add that I have also prepared a Tesseract 4.1.1 compilation with the dependencies as in point 2b. The quality of OCR did not change then.
  8. I use tessdata_best as a temporary workaround (and it work), but the OCR speed for this model is not satisfactory for me.

Expected Behavior:

I expect Tesseract 5.1.0 to recognize characters correctly, ie not converting "l", "m" to "j" or "i" to "j" for example in the tessdata_fast mode. I would like character recognition to work similar to Tesseract 4.1.1.

Suggested Fix:

Consideration of an upgrade for deu.traineddata models on the website: https://github.com/tesseract-ocr/tessdata_fast

stweil commented 2 years ago

Tesseract 5 still supports the model files from Tesseract 4 with the "legacy mode", so if you are happy with that, you can use it.

stweil commented 2 years ago

@krzysiekj94, I get a different result:

tesseract https://user-images.githubusercontent.com/12548678/158796308-0e0e8e57-ad24-4eb5-b70a-0c6b99722663.png - -l tessdata_fast/deu
Siegfried Aalfelden
Kurt-Schumacher-Platz 10
13405 Berlin

26.02.2019

Sehr geehrter Herr Aalfelden,

Informationen

Das Internet (von englisch internetwork, zusammengesetzt aus dem Präfix inter und
network „Netzwerk“ oder kurz net ‚Netz‘), umgangssprachlich auch Netz, ist ein
weltweiter Verbund von Rechnernetzwerken, den autonomen Systemen. Es ermöglicht
die Nutzung von Internetdiensten wie WWW, E-Mail, Telnet, SSH, XMPP, MOTT und
FTP. Dabei kann sich jeder Rechner mit jedem anderen Rechner verbinden. Der
Datenaustausch zwischen den über das Internet verbundenen Rechnern erfolgt über
die technisch normierten Internetprotokolle. Die Technik des Internets wird durch die
RFCs der Internet Engineering Task Force (IETF) beschrieben.

Die Verbreitung des Internets hat zu umfassenden Umwälzungen in vielen
Lebensbereichen geführt. Es trug zu einem Modernisierungsschub in vielen
Wirtschaftsbereichen sowie zur Entstehung neuer Wirtschaftszweige bei und hat zu
einem grundlegenden Wandel des Kommunikationsverhaltens und der Mediennutzung
im beruflichen und privaten Bereich geführt. Die kulturelle Bedeutung dieser
Entwicklung wird manchmal mit der Erfindung des Buchdrucks gleichgesetzt.

Die Übertragung von Daten im Internet unabhängig von ihrem Inhalt, dem Absender
und dem Empfänger wird als Netzneutralität bezeichnet.

Mit freundlichen Grüßen
krzysiekj94 commented 2 years ago

Hello @stweil , thanks for response. I have more questions now:

  1. Where exactly to get the tessdata_fast data for "legacy mode"? Based on the documentation https://github.com/tesseract-ocr/tessdata_fast I can see that legacy mode is not supported.
  2. Can support legacy for tesseract be enforced when building a solution via CMakeList.txt from CMake? If so, where is that option? It's my preview of CMake with tesseract: test
  3. On which version of Tesseract did you get the correct result? Where can I get the exact version you used? Is it 32-bit or 64-bit? Can you send a link to this version?
  4. Is it possible that there is a different behavior for tessdata_fast on 32-bit and 64-bit versions of Tesseract 5.0.1?
  5. Can there be problems with the use of CAPI? Perhaps I should switch to object oriented programming? Below is an example of initialize a tesseract in my wrapper code: image image

Thanks in advance for your answer! Have a nice day.

stweil commented 2 years ago

Please try the OCR with the default tesseract application. If that works fine (like it does in my test) you have to find out what you have to fix in your application.

stweil commented 2 years ago

Please use the Tesseract user forum for questions. The GitHub issues are not a support forum.

You might try the Windows binaries from https://github.com/UB-Mannheim/tesseract/wiki/.

zdenop commented 2 years ago

Legacy model is available only in https://github.com/tesseract-ocr/tessdata.

Shreeshrii commented 2 years ago

I use tessdata_best as a temporary workaround (and it work), but the OCR speed for this model is not satisfactory for me.

Then please try, as suggested above, with model from https://github.com/tesseract-ocr/tessdata which has legacy models as well as the 'fast' version of 'tessdata_best' models. Both are available in the same traineddata file, invoked with different --oem settings.

krzysiekj94 commented 2 years ago

Please try the OCR with the default tesseract application. If that works fine (like it does in my test) you have to find out what you have to fix in your application.

  1. After installing the Mannheim installation, the OCR seems to form fine, but I don't know where it comes from. In case I prepared the console version 32-bit of the tesseract.exe application myself, it works incorrectly with the tessdata_fast data -> see: image. In my opinion it may be something related to the Visual Studio compiler? Is this a good direction? I'll check it out again....
krzysiekj94 commented 2 years ago

I found one of the articles that seems to be similar to my problem: https://github.com/tesseract-ocr/tesseract/issues/3283 I did one of the tests and changed the option related to O2 optimization to disabled. I was very surprised because disabling /O2 optimization caused OCR to return almost identical texts as in tesseract 4.1.1, which I expected. See below for settings in Visual Studio 2019 and for differences in text:

image

image

Attention: Now I have a question: Would it be wise if I compiled tesseractlib 5.1.0 with the /O2 optimization option turned off? Does this have any unexpected consequences? Maybe someone has similar problems and experiences? I feel close to solving the problem.

stweil commented 2 years ago

Related issues: #2898 and #3283.

stweil commented 2 years ago

After installing the Mannheim installation, the OCR seems to form fine, but I don't know where it comes from.

The UB Mannheim binaries are build with the GNU compiler. Therefore they don't have this issue.

zdenop commented 2 years ago

64bit works for me.:

>tesseract -v
tesseract 5.1.0-7-g0e526
 leptonica-1.83.0 (Jan 26 2022, 19:15:03) [MSC v.1929 LIB Release x64]
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 2019
 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 libzstd/1.4.9
 Found libcurl/7.75.0 zlib/1.2.11 libssh2/1.10.1_DEV
>tesseract i3769.png - -l tessdata_fast/deu
Siegfried Aalfelden
Kurt-Schumacher-Platz 10
13405 Berlin

26.02.2019

Sehr geehrter Herr Aalfelden,

Informationen

Das Internet (von englisch internetwork, zusammengesetzt aus dem Präfix inter und
network „Netzwerk“ oder kurz net ‚Netz‘), umgangssprachlich auch Netz, ist ein
weltweiter Verbund von Rechnernetzwerken, den autonomen Systemen. Es ermöglicht
die Nutzung von Internetdiensten wie WWW, E-Mail, Telnet, SSH, XMPP, MOTT und
FTP. Dabei kann sich jeder Rechner mit jedem anderen Rechner verbinden. Der
Datenaustausch zwischen den über das Internet verbundenen Rechnern erfolgt über
die technisch normierten Internetprotokolle. Die Technik des Internets wird durch die
RFCs der Internet Engineering Task Force (IETF) beschrieben.

Die Verbreitung des Internets hat zu umfassenden Umwälzungen in vielen
Lebensbereichen geführt. Es trug zu einem Modernisierungsschub in vielen
Wirtschaftsbereichen sowie zur Entstehung neuer Wirtschaftszweige bei und hat zu
einem grundlegenden Wandel des Kommunikationsverhaltens und der Mediennutzung
im beruflichen und privaten Bereich geführt. Die kulturelle Bedeutung dieser
Entwicklung wird manchmal mit der Erfindung des Buchdrucks gleichgesetzt.

Die Übertragung von Daten im Internet unabhängig von ihrem Inhalt, dem Absender
und dem Empfänger wird als Netzneutralität bezeichnet.

Mit freundlichen Grüßen
zdenop commented 2 years ago

Can you try /Ox instead of /O2?

krzysiekj94 commented 2 years ago

Hello @zdenop .

1). The problem still exists with the use of \ Ox. OCR returns the same result as the \O2 flag.

image

image

2). In the case of the / O1 flag, the results are even worse:

image

krzysiekj94 commented 2 years ago

64bit works for me.:

>tesseract -v
tesseract 5.1.0-7-g0e526
 leptonica-1.83.0 (Jan 26 2022, 19:15:03) [MSC v.1929 LIB Release x64]
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 2019
 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 libzstd/1.4.9
 Found libcurl/7.75.0 zlib/1.2.11 libssh2/1.10.1_DEV
>tesseract i3769.png - -l tessdata_fast/deu
Siegfried Aalfelden
Kurt-Schumacher-Platz 10
13405 Berlin

26.02.2019

Sehr geehrter Herr Aalfelden,

Informationen

Das Internet (von englisch internetwork, zusammengesetzt aus dem Präfix inter und
network „Netzwerk“ oder kurz net ‚Netz‘), umgangssprachlich auch Netz, ist ein
weltweiter Verbund von Rechnernetzwerken, den autonomen Systemen. Es ermöglicht
die Nutzung von Internetdiensten wie WWW, E-Mail, Telnet, SSH, XMPP, MOTT und
FTP. Dabei kann sich jeder Rechner mit jedem anderen Rechner verbinden. Der
Datenaustausch zwischen den über das Internet verbundenen Rechnern erfolgt über
die technisch normierten Internetprotokolle. Die Technik des Internets wird durch die
RFCs der Internet Engineering Task Force (IETF) beschrieben.

Die Verbreitung des Internets hat zu umfassenden Umwälzungen in vielen
Lebensbereichen geführt. Es trug zu einem Modernisierungsschub in vielen
Wirtschaftsbereichen sowie zur Entstehung neuer Wirtschaftszweige bei und hat zu
einem grundlegenden Wandel des Kommunikationsverhaltens und der Mediennutzung
im beruflichen und privaten Bereich geführt. Die kulturelle Bedeutung dieser
Entwicklung wird manchmal mit der Erfindung des Buchdrucks gleichgesetzt.

Die Übertragung von Daten im Internet unabhängig von ihrem Inhalt, dem Absender
und dem Empfänger wird als Netzneutralität bezeichnet.

Mit freundlichen Grüßen

In my case, unfortunately, I can't use the x64 version because I have a 32-bit application that uses Tesseract's .dll's :(

amitdo commented 2 years ago

Now I have a question: Would it be wise if I compiled tesseractlib 5.1.0 with the /O2 optimization option turned off? Does this have any unexpected consequences?

You tried it yourself with a good result. The expected consequence is much slower program execution.

Which version of MSVC 2019 exactly do you use? If it's not lhe latest one (16.11.11), can you upgrade to the latest one and retest?

If the issue still exist with the latest MSVC 2019 version, I suggest to send a new bug report to Microsoft, or reuse this one: https://developercommunity2.visualstudio.com/t/1336629.

amitdo commented 2 years ago

This is similar to issue #3283.

I closed this issue because it seems to be an issue with MSVC, not with Tesseract.

If a future version of MSVC will solve the issue, let us know.

krzysiekj94 commented 2 years ago

At the moment I'm using VS version 16.9.6 (older version) but I compiled on a different computer with the same VS 2019 x86 version. Interestingly, with /O2 optimization, but without AVX2, OCR works fine. Why? I do not know.

Edit: However, I noticed that after copying the generated Tesseract from a computer without AVX2 support, the problem occurs with copied dll's on a computer that supports AVX2. So I'll have to check on VS 16.11.11 anyway.

obraz

stweil commented 2 years ago

Interestingly, with /O2 optimization, but without AVX2, OCR works fine.

So the Microsoft compiler creates buggy code with /O2 for intsimdmatrixavx2.cpp.

@krzysiekj94, you could try to add #pragma optimize( "", off ) in that file and test whether that fixes the issue. If that works, you could also try #pragma optimize( "s", on ) as an additional pragma.

krzysiekj94 commented 2 years ago

Interestingly, with /O2 optimization, but without AVX2, OCR works fine.

So the Microsoft compiler creates buggy code with /O2 for intsimdmatrixavx2.cpp.

@krzysiekj94, you could try to add #pragma optimize( "", off ) in that file and test whether that fixes the issue. If that works, you could also try #pragma optimize( "s", on ) as an additional pragma.

Hello @stweil ,

1). It looks like after adding only #pragma optimize( "", off ) in the intsimdmatrixavx2.cpp works - see code and comparing results: obraz obraz

2). After adding only #pragma optimize( "s", on ) in the intsimdmatrixavx2.cpp - you can see that quality OCR is worse

obraz obraz

3). After adding #pragma optimize( "", off ) and #pragma optimize( "s", on ) together - I have the same result as when I added only #pragma optimize( "", off )

obraz obraz

My question is: I understand that by "you could also try #pragma optimize (" s ", on) as an additional pragma" you mean using these two #pragma together - as in step 3?

krzysiekj94 commented 2 years ago

Now I have a question: Would it be wise if I compiled tesseractlib 5.1.0 with the /O2 optimization option turned off? Does this have any unexpected consequences?

You tried it yourself with a good result. The expected consequence is much slower program execution.

Which version of MSVC 2019 exactly do you use? If it's not lhe latest one (16.11.11), can you upgrade to the latest one and retest?

If the issue still exist with the latest MSVC 2019 version, I suggest to send a new bug report to Microsoft, or reuse this one: https://developercommunity2.visualstudio.com/t/1336629.

@amitdo On version 16.11.11 the problem still recurs. I checked it.

stweil commented 2 years ago

My question is: I understand that by "you could also try #pragma optimize (" s ", on) as an additional pragma" you mean using these two #pragma together - as in step 3?

Yes, that's right. The first pragma disables the optimization options from your build environment. This was expected to work, but disabling all optimizations might result in bad performance. The second pragma therefore enables size optimization (similar to compiler option /Os). See the Microsoft doumentation for details.

Now those two pragma statements should be included conditionally, namely only for 32 bit builds and those compiler versions which show the bug. Maybe you can find out how this can be done with preprocessor conditionals. Then that code lines can be added to the official code.

krzysiekj94 commented 2 years ago

I have added below a suggestion for a fix VS x86 version 16.5 - 16.11 (https://docs.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170).

See: https://pastebin.com/K1q5PzRf

obraz

stweil commented 2 years ago

That looks good. Do you want to send a pull request? Then just add a comment (and an empty line after line 16).

zdenop commented 2 years ago

The problem is with the 32-bit build only so there should be a check for 64-bit (_WIN64) build as _WIN32 is defined for 32 and 64-bit build.

amitdo commented 2 years ago

In #3283, Windows 10 64-bit with VS 2019 32-bit build was used. How can we detect this combination?

stweil commented 2 years ago

Only a built time check is needed (#if ... defined(_WIN32) && !defined(_WIN64) ...). The resulting 32 bit code fails on both 32 and 64 bit Windows.

krzysiekj94 commented 2 years ago

@amitdo Hmm, from what I can see after changing the x86 / x64 compilation in combobox, the #pragma section turns on / off - after adding defined (WIN32) - this is probably used by MS to detect compile mode. Maybe it it's way to solve this problem?

obraz obraz

Code below:

if defined(_MSC_VER) && defined(_WIN32) && defined(WIN32) && _MSC_VER >= 1925 && _MSC_VER <= 1929

pragma optimize("", off)

pragma optimize("s", on)

endif

Article showing differences with using _WIN32 & WIN32: https://accu.org/journals/overload/24/132/wilson_2223/ I hope I understood it correctly.

zdenop commented 2 years ago

WIN32 is defined by the SDK or the build environment, so it does not use the implementation reserved namespace

see: https://stackoverflow.com/questions/662084/whats-the-difference-between-the-win32-and-win32-defines-in-c

The non-underscore WIN32 is not well documented and appears to have no bearing on 32 vs 64 machine type. Standard Visual C++ projects for Windows generally don't appear to use it (it may not be in use at all).

see: https://stackoverflow.com/questions/17380340/win32-preprocessor-definition-in-64bit-windows-platform/51682888#51682888

Also https://docs.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160 mentions only _WIN32 / _WIN64

stweil commented 2 years ago

That's right, and it should be sufficient to use only those two official macros (see my previous comment).

BJungmann commented 2 years ago

I noticed that after copying the generated Tesseract from a computer without AVX2 support, the problem occurs with copied dll's on a computer that supports AVX2

When I did my tests for #3283, I also tried to disable AVX2 usage with the statement avx2_available_ = false; in src/arch/simddetect.cpp, line 200. This is where the decision is made at runtime, so it should work on machines that have avx2 available. This gave good results, and did not slow down that much. So I suggest to enclose this statement in the proper _MSC_VER macros, rather than turning off the optimizations with #pragma optimize.

On version 16.11.11 the problem still recurs. I checked it.

Thank you for checking that. So my preferred workaround ist still using VS 2019 with platform toolset v141 (which belongs to VS 2017) - you need no code patch then.

krzysiekj94 commented 2 years ago

Hi @BJungmann,

1). Thanks for the suggestion for the version for VS 2017 version. I made a sample build for version 15.9.45 Community - see below: image 2). I did performance tests - here the differences are diametrical excluding optimization in intsimdmatrixavx2.cpp - see below test for 7 page tiff in favor of VS 2017. I haven't checked your patch, but I think it will be above that time as well. The OCR results are identical. image

3). IMO, it seems that any change from #pragma will increase the OCR execution time... I saw that you have already reported this problem to the microsoft team, but it has status: "Closed - Not Enough Info" - https://developercommunity2.visualstudio.com/t/1336629. Will you report this issue to microsoft again? I was wondering whether to do it myself, but maybe you already have it in your plans?

BJungmann commented 2 years ago

Indeed execution time with the avx2available patch is increased, but considerably less than with all optimizations turned off. This is the reason why I still recommend platform toolset v141. I have no current plans to start a new effort with Microsoft. They like very short demonstration code for bug reports. A short main program and data set that shows wrong results would be feasible. But I do not understand enough details in the tesseract code using the MatrixDotVector functions, to see which call to which function with which data produces different results if executed with AVX2 hardware.

amitdo commented 2 years ago

@stweil, can you push a workaround for this issue?

stweil commented 2 years ago

@stweil, can you push a workaround for this issue?

Something like #3778?

amitdo commented 2 years ago

Yes :-)