Closed jwnsu closed 4 years ago
set the locale "C".
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Jun 14, 2018 at 9:24 AM jwnsu notifications@github.com wrote:
Ubuntu 16.04, default locale is "en_US.UTF-8". Invoke tesseract library via cffi. Now fail with following error: !strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192
baseapi.cpp locale assertion was introduce in commit 3292484 https://github.com/tesseract-ocr/tesseract/commit/3292484f67af8bdda23aa5e510918d0115785291 on 06/07/18.
Any suggestion to get around this issue? Thx.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1670, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o6FBzYQJa32lzFfd8uPVQQI2fxkzks5t8d6HgaJpZM4UnRbY .
Thx. Any side effect by force setting to "C"?
Sure. Setting the locale has lots of side effects. My default locale for python is de_DE.UTF-8
, so the default can be different. You have to find out whether "C" works with your python code or must restore the original locale after calling the Tesseract API.
Tesseract currently requires "C" locale because otherwise some functions can give bad results or fail.
Thanks. I have added info to a new wiki page https://github.com/tesseract-ocr/tesseract/wiki/4.0x-Common-Errors-and-Resolutions
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Jun 14, 2018 at 9:46 AM Stefan Weil notifications@github.com wrote:
Sure. Setting the locale has lots of side effects. My default locale for python is de_DE.UTF-8, so the default can be different. You have to find out whether "C" works with your python code or must restore the original locale after calling the Tesseract API.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1670#issuecomment-397165365, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ow1FgAbIMoFm3pKKJ5lxh5dUSxKhks5t8eOggaJpZM4UnRbY .
This is going to cause huge problems for people who are running Tesseract as a library. Setting locale="C" will probably cause various unwanted side-effects throughout the application.
Setting/resetting locale for the duration of Tesseract API calls is also problematic in multithreaded applications, for example.
I suggest that instead of requiring locale="C", to change Tesseract to use something other than sscanf() for parsing strings in a locale-independent way.
@laurikari, I agree. As soon as all *scanf code is replaced by code which does not depend on the locale, the assertions can be removed. We just had to make sure now that people don't get wrong results without any notice.
Even for C/C++ I usually call
setlocale(LC_CTYPE, "");
as the first thing in main
, which sets the locale to the value specified in the environment.
Depending on "C"
locale seems quite bad to me.
Related issues which are we reason why we currently enforce "C" locale: #1250, #1532.
Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: atoi
, isspace
, strtod
, strtof
, strtol
, sscanf
.
2018-10-08: printf
, fprint
and other *printf
need fixes for formatting of float and double values.
So currently there's no other solution except setting: LC_ALL=C
?
I am using a pattern like below to temporary set the locale to C
while initiating the engine:
char *old_ctype = strdup(setlocale(LC_ALL, NULL));
setlocale(LC_ALL, "C");
tesseract::TessBaseAPI api;
api.InitForAnalysePage();
setlocale(LC_ALL, old_ctype);
free(old_ctype);
Is this correct or does it only bypass the assertion?
It avoids the assertion, but the problem which was the reason why this assertion was added remains, so users risk to get wrong results or crashes later.
Oh that's not good. In my experiments it seemed to solve the problems in https://github.com/tesseract-ocr/tesseract/issues/1532 and I was able to OCR japanese/korean text, which I was not before. I was hopeful that the locale-sensitive operations where done during init.
In my case, tesseract is called by the user via language bindings, so I cannot permanently change the locale of the process. The only solution is to temporary set the locale in the bindings when calling the tesseract api.
Our full bindings are pretty minimal. Where else we need to temporary set the locale to C? The OCR happens here:
api->ClearAdaptiveClassifier();
api->SetImage(image);
if(api->GetSourceYResolution() < 70)
api->SetSourceResolution(300);
char *outText = HOCR ? api->GetHOCRText(0) : api->GetUTF8Text();
pixDestroy(&image);
api->Clear();
What about implementing jeroen code to tesseract api init?
That would not be a save solution. See my previous answer.
IMO, the right solution is here: https://github.com/tesseract-ocr/tesseract/issues/1670#issuecomment-412089053
Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: atoi, isspace, strtod, strtof, strtol, sscanf.
isspace
is addressed by pull request #1965.
I currently don't know a locale which influences atoi
or strtol
.
strtof
, strtod
cannot read values with a decimal point when the locale uses a decimal comma (like locale de_DE.UTF-8).
sscanf
and other *scanf
functions can have problems with bytes wrongly interpreted as space (see https://github.com/tesseract-ocr/tesseract/issues/1250#issuecomment-354415251) and can also misinterpret float or double values with a decimal point.
*printf
functions write float and double using the decimal separator defined by the locale, but Tesseract always expects a decimal point.
So those last three groups have to be fixed / replaced before we can remove the assertion.
Instead of strtof
and strtod
, strtof_l
and strtod_l
(or _strtof_l
and _strtod_l
for Windows) can be used.
My current workaround for this looks like this:
from locale import setlocale
from contextlib import contextmanager
@contextmanager
def c_locale(reset_to="C.UTF-8"):
setlocale(locale.LC_CTYPE, "C")
yield
setlocale(locale.LC_CTYPE, reset_to)
with c_locale():
from tesserocr import PyTessBaseAPI
with PyTessBaseAPI() as api:
api.Init(lang="deu")
api.SetImage(box_image)
ocr_result = api.GetUTF8Text()
print(ocr_result)
Has anyone an idea how to set the C locale for a JNA library when calling it from Java ?
I tried to set Locale.setDefaultLocale(Locale.ROOT)
, but this didn't help.
We are using tess4j, a JNA wrapper to use tesseract from Java, and using tesseract4 does not work because of the assertion.
I tried to set
Locale.setDefaultLocale(Locale.ROOT)
, but this didn't help.
And by the way, this also wouldn't work in a web environment, because this setting is done VM - wide, so it would affect everything else that is happening in parallel as well.
I found a way to set the locale to "C" from Java (using JNA). See here for a discussion: https://github.com/nguyenq/tess4j/issues/106#issuecomment-437361950
It works, but I am not sure about any side effects of this.
Pull request #2420 replaces strtof
and strtod
which fixes more dependencies on the locale settings. The critical sscanf
calls were already replaced by earlier commits.
I think we can now consider removing the assertion as soon as we have tested that the issues #1250 and #1532 are still fixed.
C++17 has to_chars()
and from_chars()
.
https://en.cppreference.com/w/cpp/utility/to_chars https://en.cppreference.com/w/cpp/utility/from_chars
Compilers support is currently partial.
After typing export LC_ALL=C
in the terminal, run the python code in the same terminal. Running the python code in different terminal window won't work. If using IDE, open the IDE from the terminal where export LC_ALL=C
is entered.
Tesseract 4.1 and 5.0 no longer depend on the locale settings.
Thanks. So my bindings need to compile with any current version of tesseract. So to summarize, I only need to set the locale if Tesseract 4.x < 4.1
and not for 3.x and also not for 4.1 +?
That's right.
Can we close this issue?
There was no recent activity and I think everything was answered, so I close it now.
Workaround for python users
import locale
locale.setlocale(locale.LC_CTYPE, 'C') # set locale to C
import tesserocr
locale.setlocale(locale.LC_CTYPE, '') # set locale back
@wd: which tesseract version are you using? AFAIK this problem is solved in recent tesseract version.
@zdenop I know it's solved in 4.1. But the python docker use debian stable(buster) as the base image, which only include tesseract 4.0.
Use add-apt-repository -y ppa:alex-p/tesseract-ocr
before installing Tesseract in your Dockerfile
to get a newer release.
@stweil I'm using debian, first I tried to add the ppa use command add-apt-repository
, and run apt update
, bug got 404 error.
Ign:3 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal InRelease
Err:6 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal Release
404 Not Found [IP: 91.189.95.83 80]
I checked the source list file.
deb http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal main
And checked the URL http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu/dists/, there isn't a dist named as focal
.
After doing some research, I noticed Ubuntu 18.10(Cosmic) has the same version of libc6(2.28) with Debian buster. But I think there isn't an binary version for Cosmic, http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu/dists/cosmic/main/binary-amd64/Packages.gz is an empty file.
Are there anything I missed? I still can't install tesseract-ocr 4.1 on debian buster.
ping @AlexanderP
buster:
deb https://notesalexp.org/tesseract-ocr/buster/ buster main
cosmic:
deb https://notesalexp.org/tesseract-ocr/cosmic/ cosmic main
Fetch and install the GnuPG key
sudo apt-get update -oAcquire::AllowInsecureRepositories=true
sudo apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true
sudo apt-get update
Thanks, finally I have upgraded tesseract-ocr to 4.1. And I also add more notes in the wiki for user's who want to install 4.1 on stable and other versions. Previously it's just a link, I didn't realize it's has instructions about how to install it in Debian stable.
Ubuntu 16.04, default locale is "en_US.UTF-8". Invoke tesseract library via cffi. Now fail with following error:
!strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192
It worked fine before baseapi.cpp locale assertion was introduce in commit 3292484f67af8bdda23aa5e510918d0115785291 on 06/07/18.
Any suggestion to get around this issue? Thx.
C or C++ program seems to set default locale "C", however, it's not the case for python, where default is "en_US.UTF-8".