Closed krzysiekj94 closed 2 years ago
Tesseract 5 still supports the model files from Tesseract 4 with the "legacy mode", so if you are happy with that, you can use it.
@krzysiekj94, I get a different result:
tesseract https://user-images.githubusercontent.com/12548678/158796308-0e0e8e57-ad24-4eb5-b70a-0c6b99722663.png - -l tessdata_fast/deu
Siegfried Aalfelden
Kurt-Schumacher-Platz 10
13405 Berlin
26.02.2019
Sehr geehrter Herr Aalfelden,
Informationen
Das Internet (von englisch internetwork, zusammengesetzt aus dem Präfix inter und
network „Netzwerk“ oder kurz net ‚Netz‘), umgangssprachlich auch Netz, ist ein
weltweiter Verbund von Rechnernetzwerken, den autonomen Systemen. Es ermöglicht
die Nutzung von Internetdiensten wie WWW, E-Mail, Telnet, SSH, XMPP, MOTT und
FTP. Dabei kann sich jeder Rechner mit jedem anderen Rechner verbinden. Der
Datenaustausch zwischen den über das Internet verbundenen Rechnern erfolgt über
die technisch normierten Internetprotokolle. Die Technik des Internets wird durch die
RFCs der Internet Engineering Task Force (IETF) beschrieben.
Die Verbreitung des Internets hat zu umfassenden Umwälzungen in vielen
Lebensbereichen geführt. Es trug zu einem Modernisierungsschub in vielen
Wirtschaftsbereichen sowie zur Entstehung neuer Wirtschaftszweige bei und hat zu
einem grundlegenden Wandel des Kommunikationsverhaltens und der Mediennutzung
im beruflichen und privaten Bereich geführt. Die kulturelle Bedeutung dieser
Entwicklung wird manchmal mit der Erfindung des Buchdrucks gleichgesetzt.
Die Übertragung von Daten im Internet unabhängig von ihrem Inhalt, dem Absender
und dem Empfänger wird als Netzneutralität bezeichnet.
Mit freundlichen Grüßen
Hello @stweil , thanks for response. I have more questions now:
Thanks in advance for your answer! Have a nice day.
Please try the OCR with the default tesseract
application. If that works fine (like it does in my test) you have to find out what you have to fix in your application.
Please use the Tesseract user forum for questions. The GitHub issues are not a support forum.
You might try the Windows binaries from https://github.com/UB-Mannheim/tesseract/wiki/.
Legacy model is available only in https://github.com/tesseract-ocr/tessdata.
I use tessdata_best as a temporary workaround (and it work), but the OCR speed for this model is not satisfactory for me.
Then please try, as suggested above, with model from https://github.com/tesseract-ocr/tessdata which has legacy models as well as the 'fast' version of 'tessdata_best' models. Both are available in the same traineddata file, invoked with different --oem settings.
Please try the OCR with the default
tesseract
application. If that works fine (like it does in my test) you have to find out what you have to fix in your application.
I found one of the articles that seems to be similar to my problem: https://github.com/tesseract-ocr/tesseract/issues/3283 I did one of the tests and changed the option related to O2 optimization to disabled. I was very surprised because disabling /O2 optimization caused OCR to return almost identical texts as in tesseract 4.1.1, which I expected. See below for settings in Visual Studio 2019 and for differences in text:
Attention: Now I have a question: Would it be wise if I compiled tesseractlib 5.1.0 with the /O2 optimization option turned off? Does this have any unexpected consequences? Maybe someone has similar problems and experiences? I feel close to solving the problem.
Related issues: #2898 and #3283.
After installing the Mannheim installation, the OCR seems to form fine, but I don't know where it comes from.
The UB Mannheim binaries are build with the GNU compiler. Therefore they don't have this issue.
64bit works for me.:
>tesseract -v
tesseract 5.1.0-7-g0e526
leptonica-1.83.0 (Jan 26 2022, 19:15:03) [MSC v.1929 LIB Release x64]
libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 2019
Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 libzstd/1.4.9
Found libcurl/7.75.0 zlib/1.2.11 libssh2/1.10.1_DEV
>tesseract i3769.png - -l tessdata_fast/deu
Siegfried Aalfelden
Kurt-Schumacher-Platz 10
13405 Berlin
26.02.2019
Sehr geehrter Herr Aalfelden,
Informationen
Das Internet (von englisch internetwork, zusammengesetzt aus dem Präfix inter und
network „Netzwerk“ oder kurz net ‚Netz‘), umgangssprachlich auch Netz, ist ein
weltweiter Verbund von Rechnernetzwerken, den autonomen Systemen. Es ermöglicht
die Nutzung von Internetdiensten wie WWW, E-Mail, Telnet, SSH, XMPP, MOTT und
FTP. Dabei kann sich jeder Rechner mit jedem anderen Rechner verbinden. Der
Datenaustausch zwischen den über das Internet verbundenen Rechnern erfolgt über
die technisch normierten Internetprotokolle. Die Technik des Internets wird durch die
RFCs der Internet Engineering Task Force (IETF) beschrieben.
Die Verbreitung des Internets hat zu umfassenden Umwälzungen in vielen
Lebensbereichen geführt. Es trug zu einem Modernisierungsschub in vielen
Wirtschaftsbereichen sowie zur Entstehung neuer Wirtschaftszweige bei und hat zu
einem grundlegenden Wandel des Kommunikationsverhaltens und der Mediennutzung
im beruflichen und privaten Bereich geführt. Die kulturelle Bedeutung dieser
Entwicklung wird manchmal mit der Erfindung des Buchdrucks gleichgesetzt.
Die Übertragung von Daten im Internet unabhängig von ihrem Inhalt, dem Absender
und dem Empfänger wird als Netzneutralität bezeichnet.
Mit freundlichen Grüßen
Can you try /Ox instead of /O2?
Hello @zdenop .
1). The problem still exists with the use of \ Ox. OCR returns the same result as the \O2 flag.
2). In the case of the / O1 flag, the results are even worse:
64bit works for me.:
>tesseract -v tesseract 5.1.0-7-g0e526 leptonica-1.83.0 (Jan 26 2022, 19:15:03) [MSC v.1929 LIB Release x64] libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 Found AVX2 Found AVX Found FMA Found SSE4.1 Found OpenMP 2019 Found libarchive 3.5.1 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 libzstd/1.4.9 Found libcurl/7.75.0 zlib/1.2.11 libssh2/1.10.1_DEV
>tesseract i3769.png - -l tessdata_fast/deu Siegfried Aalfelden Kurt-Schumacher-Platz 10 13405 Berlin 26.02.2019 Sehr geehrter Herr Aalfelden, Informationen Das Internet (von englisch internetwork, zusammengesetzt aus dem Präfix inter und network „Netzwerk“ oder kurz net ‚Netz‘), umgangssprachlich auch Netz, ist ein weltweiter Verbund von Rechnernetzwerken, den autonomen Systemen. Es ermöglicht die Nutzung von Internetdiensten wie WWW, E-Mail, Telnet, SSH, XMPP, MOTT und FTP. Dabei kann sich jeder Rechner mit jedem anderen Rechner verbinden. Der Datenaustausch zwischen den über das Internet verbundenen Rechnern erfolgt über die technisch normierten Internetprotokolle. Die Technik des Internets wird durch die RFCs der Internet Engineering Task Force (IETF) beschrieben. Die Verbreitung des Internets hat zu umfassenden Umwälzungen in vielen Lebensbereichen geführt. Es trug zu einem Modernisierungsschub in vielen Wirtschaftsbereichen sowie zur Entstehung neuer Wirtschaftszweige bei und hat zu einem grundlegenden Wandel des Kommunikationsverhaltens und der Mediennutzung im beruflichen und privaten Bereich geführt. Die kulturelle Bedeutung dieser Entwicklung wird manchmal mit der Erfindung des Buchdrucks gleichgesetzt. Die Übertragung von Daten im Internet unabhängig von ihrem Inhalt, dem Absender und dem Empfänger wird als Netzneutralität bezeichnet. Mit freundlichen Grüßen
In my case, unfortunately, I can't use the x64 version because I have a 32-bit application that uses Tesseract's .dll's :(
Now I have a question: Would it be wise if I compiled tesseractlib 5.1.0 with the /O2 optimization option turned off? Does this have any unexpected consequences?
You tried it yourself with a good result. The expected consequence is much slower program execution.
Which version of MSVC 2019 exactly do you use? If it's not lhe latest one (16.11.11), can you upgrade to the latest one and retest?
If the issue still exist with the latest MSVC 2019 version, I suggest to send a new bug report to Microsoft, or reuse this one: https://developercommunity2.visualstudio.com/t/1336629.
This is similar to issue #3283.
I closed this issue because it seems to be an issue with MSVC, not with Tesseract.
If a future version of MSVC will solve the issue, let us know.
At the moment I'm using VS version 16.9.6 (older version) but I compiled on a different computer with the same VS 2019 x86 version. Interestingly, with /O2 optimization, but without AVX2, OCR works fine. Why? I do not know.
Edit: However, I noticed that after copying the generated Tesseract from a computer without AVX2 support, the problem occurs with copied dll's on a computer that supports AVX2. So I'll have to check on VS 16.11.11 anyway.
Interestingly, with /O2 optimization, but without AVX2, OCR works fine.
So the Microsoft compiler creates buggy code with /O2
for intsimdmatrixavx2.cpp.
@krzysiekj94, you could try to add #pragma optimize( "", off )
in that file and test whether that fixes the issue. If that works, you could also try #pragma optimize( "s", on )
as an additional pragma.
Interestingly, with /O2 optimization, but without AVX2, OCR works fine.
So the Microsoft compiler creates buggy code with
/O2
for intsimdmatrixavx2.cpp.@krzysiekj94, you could try to add
#pragma optimize( "", off )
in that file and test whether that fixes the issue. If that works, you could also try#pragma optimize( "s", on )
as an additional pragma.
Hello @stweil ,
1). It looks like after adding only #pragma optimize( "", off ) in the intsimdmatrixavx2.cpp works - see code and comparing results:
2). After adding only #pragma optimize( "s", on ) in the intsimdmatrixavx2.cpp - you can see that quality OCR is worse
3). After adding #pragma optimize( "", off ) and #pragma optimize( "s", on ) together - I have the same result as when I added only #pragma optimize( "", off )
My question is: I understand that by "you could also try #pragma optimize (" s ", on) as an additional pragma" you mean using these two #pragma together - as in step 3?
Now I have a question: Would it be wise if I compiled tesseractlib 5.1.0 with the /O2 optimization option turned off? Does this have any unexpected consequences?
You tried it yourself with a good result. The expected consequence is much slower program execution.
Which version of MSVC 2019 exactly do you use? If it's not lhe latest one (16.11.11), can you upgrade to the latest one and retest?
If the issue still exist with the latest MSVC 2019 version, I suggest to send a new bug report to Microsoft, or reuse this one: https://developercommunity2.visualstudio.com/t/1336629.
@amitdo On version 16.11.11 the problem still recurs. I checked it.
My question is: I understand that by "you could also try #pragma optimize (" s ", on) as an additional pragma" you mean using these two #pragma together - as in step 3?
Yes, that's right. The first pragma disables the optimization options from your build environment. This was expected to work, but disabling all optimizations might result in bad performance. The second pragma therefore enables size optimization (similar to compiler option /Os
). See the Microsoft doumentation for details.
Now those two pragma
statements should be included conditionally, namely only for 32 bit builds and those compiler versions which show the bug. Maybe you can find out how this can be done with preprocessor conditionals. Then that code lines can be added to the official code.
I have added below a suggestion for a fix VS x86 version 16.5 - 16.11 (https://docs.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170).
That looks good. Do you want to send a pull request? Then just add a comment (and an empty line after line 16).
The problem is with the 32-bit build only so there should be a check for 64-bit (_WIN64
) build as _WIN32
is defined for 32 and 64-bit build.
In #3283, Windows 10 64-bit with VS 2019 32-bit build was used. How can we detect this combination?
Only a built time check is needed (#if ... defined(_WIN32) && !defined(_WIN64) ...
). The resulting 32 bit code fails on both 32 and 64 bit Windows.
@amitdo Hmm, from what I can see after changing the x86 / x64 compilation in combobox, the #pragma section turns on / off - after adding defined (WIN32) - this is probably used by MS to detect compile mode. Maybe it it's way to solve this problem?
Code below:
Article showing differences with using _WIN32 & WIN32: https://accu.org/journals/overload/24/132/wilson_2223/ I hope I understood it correctly.
WIN32 is defined by the SDK or the build environment, so it does not use the implementation reserved namespace
The non-underscore WIN32 is not well documented and appears to have no bearing on 32 vs 64 machine type. Standard Visual C++ projects for Windows generally don't appear to use it (it may not be in use at all).
Also https://docs.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160 mentions only _WIN32
/ _WIN64
That's right, and it should be sufficient to use only those two official macros (see my previous comment).
I noticed that after copying the generated Tesseract from a computer without AVX2 support, the problem occurs with copied dll's on a computer that supports AVX2
When I did my tests for #3283, I also tried to disable AVX2 usage with the statement
avx2_available_ = false;
in src/arch/simddetect.cpp, line 200. This is where the decision is made at runtime, so it should work on machines that have avx2 available.
This gave good results, and did not slow down that much. So I suggest to enclose this statement in the proper _MSC_VER macros, rather than turning off the optimizations with #pragma optimize.
On version 16.11.11 the problem still recurs. I checked it.
Thank you for checking that. So my preferred workaround ist still using VS 2019 with platform toolset v141 (which belongs to VS 2017) - you need no code patch then.
Hi @BJungmann,
1). Thanks for the suggestion for the version for VS 2017 version. I made a sample build for version 15.9.45 Community - see below: 2). I did performance tests - here the differences are diametrical excluding optimization in intsimdmatrixavx2.cpp - see below test for 7 page tiff in favor of VS 2017. I haven't checked your patch, but I think it will be above that time as well. The OCR results are identical.
3). IMO, it seems that any change from #pragma will increase the OCR execution time... I saw that you have already reported this problem to the microsoft team, but it has status: "Closed - Not Enough Info" - https://developercommunity2.visualstudio.com/t/1336629. Will you report this issue to microsoft again? I was wondering whether to do it myself, but maybe you already have it in your plans?
Indeed execution time with the avx2available patch is increased, but considerably less than with all optimizations turned off. This is the reason why I still recommend platform toolset v141. I have no current plans to start a new effort with Microsoft. They like very short demonstration code for bug reports. A short main program and data set that shows wrong results would be feasible. But I do not understand enough details in the tesseract code using the MatrixDotVector functions, to see which call to which function with which data produces different results if executed with AVX2 hardware.
@stweil, can you push a workaround for this issue?
@stweil, can you push a workaround for this issue?
Something like #3778?
Yes :-)
Environment
Current Behavior:
I have the following problem:
Expected Behavior:
I expect Tesseract 5.1.0 to recognize characters correctly, ie not converting "l", "m" to "j" or "i" to "j" for example in the tessdata_fast mode. I would like character recognition to work similar to Tesseract 4.1.1.
Suggested Fix:
Consideration of an upgrade for deu.traineddata models on the website: https://github.com/tesseract-ocr/tessdata_fast