tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.42k stars 9.53k forks source link

recent change setlocale in baseapi.c causes Python loaded tesseract library to fail #1670

Closed jwnsu closed 4 years ago

jwnsu commented 6 years ago

Ubuntu 16.04, default locale is "en_US.UTF-8". Invoke tesseract library via cffi. Now fail with following error: !strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192

It worked fine before baseapi.cpp locale assertion was introduce in commit 3292484f67af8bdda23aa5e510918d0115785291 on 06/07/18.

Any suggestion to get around this issue? Thx.

C or C++ program seems to set default locale "C", however, it's not the case for python, where default is "en_US.UTF-8".

Shreeshrii commented 6 years ago

set the locale "C".

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 14, 2018 at 9:24 AM jwnsu notifications@github.com wrote:

Ubuntu 16.04, default locale is "en_US.UTF-8". Invoke tesseract library via cffi. Now fail with following error: !strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 192

baseapi.cpp locale assertion was introduce in commit 3292484 https://github.com/tesseract-ocr/tesseract/commit/3292484f67af8bdda23aa5e510918d0115785291 on 06/07/18.

Any suggestion to get around this issue? Thx.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1670, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o6FBzYQJa32lzFfd8uPVQQI2fxkzks5t8d6HgaJpZM4UnRbY .

jwnsu commented 6 years ago

Thx. Any side effect by force setting to "C"?

stweil commented 6 years ago

Sure. Setting the locale has lots of side effects. My default locale for python is de_DE.UTF-8, so the default can be different. You have to find out whether "C" works with your python code or must restore the original locale after calling the Tesseract API.

Tesseract currently requires "C" locale because otherwise some functions can give bad results or fail.

Shreeshrii commented 6 years ago

Thanks. I have added info to a new wiki page https://github.com/tesseract-ocr/tesseract/wiki/4.0x-Common-Errors-and-Resolutions

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 14, 2018 at 9:46 AM Stefan Weil notifications@github.com wrote:

Sure. Setting the locale has lots of side effects. My default locale for python is de_DE.UTF-8, so the default can be different. You have to find out whether "C" works with your python code or must restore the original locale after calling the Tesseract API.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1670#issuecomment-397165365, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ow1FgAbIMoFm3pKKJ5lxh5dUSxKhks5t8eOggaJpZM4UnRbY .

laurikari commented 6 years ago

This is going to cause huge problems for people who are running Tesseract as a library. Setting locale="C" will probably cause various unwanted side-effects throughout the application.

Setting/resetting locale for the duration of Tesseract API calls is also problematic in multithreaded applications, for example.

I suggest that instead of requiring locale="C", to change Tesseract to use something other than sscanf() for parsing strings in a locale-independent way.

stweil commented 6 years ago

@laurikari, I agree. As soon as all *scanf code is replaced by code which does not depend on the locale, the assertions can be removed. We just had to make sure now that people don't get wrong results without any notice.

troplin commented 6 years ago

Even for C/C++ I usually call

setlocale(LC_CTYPE, "");

as the first thing in main, which sets the locale to the value specified in the environment.

Depending on "C" locale seems quite bad to me.

stweil commented 6 years ago

Related issues which are we reason why we currently enforce "C" locale: #1250, #1532.

stweil commented 6 years ago

Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: atoi, isspace, strtod, strtof, strtol, sscanf.

2018-10-08: printf, fprint and other *printf need fixes for formatting of float and double values.

zindarod commented 6 years ago

So currently there's no other solution except setting: LC_ALL=C?

jeroen commented 6 years ago

I am using a pattern like below to temporary set the locale to C while initiating the engine:

char *old_ctype = strdup(setlocale(LC_ALL, NULL));
setlocale(LC_ALL, "C");
tesseract::TessBaseAPI api;
api.InitForAnalysePage();
setlocale(LC_ALL, old_ctype);
free(old_ctype);

Is this correct or does it only bypass the assertion?

stweil commented 6 years ago

It avoids the assertion, but the problem which was the reason why this assertion was added remains, so users risk to get wrong results or crashes later.

jeroen commented 6 years ago

Oh that's not good. In my experiments it seemed to solve the problems in https://github.com/tesseract-ocr/tesseract/issues/1532 and I was able to OCR japanese/korean text, which I was not before. I was hopeful that the locale-sensitive operations where done during init.

In my case, tesseract is called by the user via language bindings, so I cannot permanently change the locale of the process. The only solution is to temporary set the locale in the bindings when calling the tesseract api.

Our full bindings are pretty minimal. Where else we need to temporary set the locale to C? The OCR happens here:

  api->ClearAdaptiveClassifier();
  api->SetImage(image);
  if(api->GetSourceYResolution() < 70)
    api->SetSourceResolution(300);
  char *outText = HOCR ? api->GetHOCRText(0) : api->GetUTF8Text();
  pixDestroy(&image);
  api->Clear();
amitdo commented 6 years ago

@stweil

For POSIX: https://stackoverflow.com/a/13919957

For Windows: ~https://docs.microsoft.com/en-us/windows/desktop/api/winnls/nf-winnls-setthreadlocale~ https://docs.microsoft.com/en-us/cpp/parallel/multithreading-and-locales

zdenop commented 6 years ago

What about implementing jeroen code to tesseract api init?

stweil commented 6 years ago

That would not be a save solution. See my previous answer.

amitdo commented 6 years ago

IMO, the right solution is here: https://github.com/tesseract-ocr/tesseract/issues/1670#issuecomment-412089053

stweil commented 6 years ago

Here is a (potentially incomplete) list of function calls which have to be replaced to get a Tesseract library which does not depend on the locale: atoi, isspace, strtod, strtof, strtol, sscanf.

isspace is addressed by pull request #1965.

I currently don't know a locale which influences atoi or strtol.

strtof, strtod cannot read values with a decimal point when the locale uses a decimal comma (like locale de_DE.UTF-8).

sscanf and other *scanf functions can have problems with bytes wrongly interpreted as space (see https://github.com/tesseract-ocr/tesseract/issues/1250#issuecomment-354415251) and can also misinterpret float or double values with a decimal point.

*printf functions write float and double using the decimal separator defined by the locale, but Tesseract always expects a decimal point.

So those last three groups have to be fixed / replaced before we can remove the assertion.

Instead of strtof and strtod, strtof_l and strtod_l (or _strtof_l and _strtod_l for Windows) can be used.

ephes commented 6 years ago

My current workaround for this looks like this:

from locale import setlocale
from contextlib import contextmanager

@contextmanager
def c_locale(reset_to="C.UTF-8"):
    setlocale(locale.LC_CTYPE, "C")
    yield
    setlocale(locale.LC_CTYPE, reset_to)

with c_locale():
    from tesserocr import PyTessBaseAPI
    with PyTessBaseAPI() as api:
        api.Init(lang="deu")
        api.SetImage(box_image)
        ocr_result = api.GetUTF8Text()
        print(ocr_result)
martin-huber commented 6 years ago

Has anyone an idea how to set the C locale for a JNA library when calling it from Java ? I tried to set Locale.setDefaultLocale(Locale.ROOT), but this didn't help. We are using tess4j, a JNA wrapper to use tesseract from Java, and using tesseract4 does not work because of the assertion.

martin-huber commented 6 years ago

I tried to set Locale.setDefaultLocale(Locale.ROOT), but this didn't help.

And by the way, this also wouldn't work in a web environment, because this setting is done VM - wide, so it would affect everything else that is happening in parallel as well.

martin-huber commented 6 years ago

I found a way to set the locale to "C" from Java (using JNA). See here for a discussion: https://github.com/nguyenq/tess4j/issues/106#issuecomment-437361950

It works, but I am not sure about any side effects of this.

stweil commented 5 years ago

Pull request #2420 replaces strtof and strtod which fixes more dependencies on the locale settings. The critical sscanf calls were already replaced by earlier commits.

I think we can now consider removing the assertion as soon as we have tested that the issues #1250 and #1532 are still fixed.

amitdo commented 5 years ago

C++17 has to_chars() and from_chars().

https://en.cppreference.com/w/cpp/utility/to_chars https://en.cppreference.com/w/cpp/utility/from_chars

Compilers support is currently partial.

agnelvishal commented 5 years ago

After typing export LC_ALL=C in the terminal, run the python code in the same terminal. Running the python code in different terminal window won't work. If using IDE, open the IDE from the terminal where export LC_ALL=C is entered.

stweil commented 5 years ago

Tesseract 4.1 and 5.0 no longer depend on the locale settings.

jeroen commented 5 years ago

Thanks. So my bindings need to compile with any current version of tesseract. So to summarize, I only need to set the locale if Tesseract 4.x < 4.1 and not for 3.x and also not for 4.1 +?

stweil commented 5 years ago

That's right.

amitdo commented 4 years ago

Can we close this issue?

stweil commented 4 years ago

There was no recent activity and I think everything was answered, so I close it now.

wd commented 4 years ago

Workaround for python users

import locale
locale.setlocale(locale.LC_CTYPE, 'C')  # set locale to C
import tesserocr
locale.setlocale(locale.LC_CTYPE, '')  # set locale back
zdenop commented 4 years ago

@wd: which tesseract version are you using? AFAIK this problem is solved in recent tesseract version.

wd commented 4 years ago

@zdenop I know it's solved in 4.1. But the python docker use debian stable(buster) as the base image, which only include tesseract 4.0.

stweil commented 4 years ago

Use add-apt-repository -y ppa:alex-p/tesseract-ocr before installing Tesseract in your Dockerfile to get a newer release.

wd commented 4 years ago

@stweil I'm using debian, first I tried to add the ppa use command add-apt-repository, and run apt update, bug got 404 error.

Ign:3 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal InRelease
Err:6 http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal Release
  404  Not Found [IP: 91.189.95.83 80]

I checked the source list file.

deb http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu focal main

And checked the URL http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu/dists/, there isn't a dist named as focal.

After doing some research, I noticed Ubuntu 18.10(Cosmic) has the same version of libc6(2.28) with Debian buster. But I think there isn't an binary version for Cosmic, http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu/dists/cosmic/main/binary-amd64/Packages.gz is an empty file.

Are there anything I missed? I still can't install tesseract-ocr 4.1 on debian buster.

Shreeshrii commented 4 years ago

ping @AlexanderP

AlexanderP commented 4 years ago

buster: deb https://notesalexp.org/tesseract-ocr/buster/ buster main cosmic: deb https://notesalexp.org/tesseract-ocr/cosmic/ cosmic main

Fetch and install the GnuPG key

sudo apt-get update -oAcquire::AllowInsecureRepositories=true
sudo apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true
sudo apt-get update
amitdo commented 4 years ago

https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata

wd commented 4 years ago

Thanks, finally I have upgraded tesseract-ocr to 4.1. And I also add more notes in the wiki for user's who want to install 4.1 on stable and other versions. Previously it's just a link, I didn't realize it's has instructions about how to install it in Debian stable.