tesseract failed loading non-english language.traineddata

gbolin commented 6 years ago

Environment

Tesseract Version: <tesseract 4.00.00dev-690-g1b0379c2>
Platform: <mac OS 64bit, High Sierra, 10.13.1>

Current Behavior:

my situation a little complicated. I made the tesseract into a lib which for other application to call, while in the "api->init" to load chi_sim, it failed, ONLY in IDE(pycharm) environment, after debugging, I located this function "load_via_fgets" in file "tesseract/ccutil/unicharset.cpp", from row 825, sscanf return 1/1/1/1/1/1/1 rather then 17/16/10/8/4/3/2, so it return 'false' to function "bool UNICHARSET::load_from_file(tesseract::TFile *file, bool skip_fragments)" in row 781. ATTENTION, this situation wont happen in terminal command line, only in IDE, also found a same problem happened in tess4j, link:. looking forward to hearing from you, thanks so much.

Expected Behavior:

Suggested Fix:

gbolin commented 6 years ago

if I load 'eng.traineddata', works fine, even load self-trained data file, it also works fine.

amitdo commented 6 years ago

Seems like an issue related to locale settings. https://www.google.co.il/search?q=osx+%22pycharm%22+%22locale%22

gbolin commented 6 years ago

@amitdo hi, I tried, but seems not working, any other ideas?

amitdo commented 6 years ago

any other ideas?

No. Try the forum.

ITCoolie commented 6 years ago

Hi all,

I want to let tesseract to output temperary image files to local disk. So I can know that where step's result. I compile the tesseract with "--enable-debug" but after recognize the image, I cannot find the temoprery image files. Is there anyone meet the similar problem? Thanks.

gbolin commented 6 years ago

@amitdo finally I got the reason, it relates the "locale". here is the explanation. after "combine_tessdata -U chi_sim.traineddata ./chi_sim.", generate a file named "chi_sim.unicharset"(This file is the key reason why non-eng traineddata files somehow could not be loaded). This function "bool UNICHARSET::load_via_fgets" in "tesseract/ccutil/unicharset.cpp:789" would read that unicharset file row by row, when arriving here (v = sscanf(buffer, "%s %x %d,%d,%d,%d,%g,%g,%g,%g,%g,%g %63s %d %d %d %63s", unichar, &properties, &min_bottom, &max_bottom, &min_top, &max_top, &width, &width_sd, &bearing, &bearing_sd, &advance, &advance_sd, script, &other_case, &direction, &mirror, normed)) != 17 let's say buffer is "格 1 63,69,255,255,192,220,0,9,205,233 Han 7 0 7 格 # 格 [683c ]x" sscanf function will call for isspace function, the letter "格“ utf-8 code is:0xE6 0xA0 0xBC, the 0xA0 was recognized as a space. so a buffer interruption happens here. That is the key reason! Tesseract will call std::locale to get the default locale setting, but exactly in unicharset.cpp, it causes sscanf function fail. not only in Chinese language, but for others , after UTF8-based locale, if a character contains some special bits value, like '0xA0', '0x85', more, especially non-english operating system, it absolutely will fail. how to solve: 1:change system into English, but maybe not a good idea, butit works for me. 2:change the unicharset.cpp source code, I tried on my own mac os, like this:

..... from row 823
char normed[64];
int v = -1;
************Add code ************
locale lc("C");
locale::global(lc);
************************
if (fgets_cb->Run(buffer, sizeof (buffer)) == NULL ||
.....continue

@amitdo thanks a lot for your reading. regards, GS.

amitdo commented 6 years ago

@stweil, your thoughts on the suggested change?

amitdo commented 6 years ago

@ITCoolie, The right place to ask general questions is the forum.

stweil commented 6 years ago

@GitHubGS, which locale did you use when it failed?

gbolin commented 6 years ago

@amitdo sorry for replying late. after input 'env' command in terminal, these following 2 pictures show what you need. the first one, set the os language into Chinese, while the 2nd one set English.

it seems that only in english environment, I saw the locale value. hope it's helpful to you.

wxs commented 6 years ago

Hey all, if you're just coming across this issue, I solved it by setting the locale in Python thus at the top of my script:

import locale
locale.setlocale(locale.LC_ALL, "C")

jeroen commented 6 years ago

I confirm that I also ran into this problem with the R bindings. All is fine for most languages, however asian languages like jpn and kor would not load with en_US.UTF-8.

A workaround is to set Sys.setlocale('LC_CTYPE', 'C') and then it works. However it is unclear to me if I can set it back to en_US.UTF-8 afterwards.

This is 3.05.01 by the way.

delonzhou commented 6 years ago

I have the same issue with the following env with idea.

macOS 10.13.4 tesseract 4.00.00alpha leptonica-1.76.0 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 Found AVX2 Found AVX Found SSE

my workaround is adding a environment variable LC_CTYPE=C in idea, it works.

stweil commented 6 years ago

Pull request #1649 makes Tesseract initialization fail if the locale settings are wrong. Users who get that failure must set the "C" locale in their code.

stweil commented 6 years ago

Technically this issue was closed by enforcing the "C" locale in pull request #1649, but that causes problems requiring ugly workarounds in projects which use the Tesseract API from Python, Java or other languages which typically don't set the "C" locale.

Therefore I suggest to keep it open.

zdenop commented 6 years ago

@stweil : what about my suggestion to implement jeroen code into TessBaseAPI::Init?

iseegr8tfuldeadppl commented 5 years ago

Make sure the environment variable TESSDATA_PREFIX is set to your tessdata directory! (for ex. C:\msys64\mingw32\share\tessdata).

zdenop commented 5 years ago

@stweil : Is assert still needed for non "C" LC_ALL?

datalogics-kam commented 5 years ago

It turns out that we're having this problem as well, on macOS with 3.05.01. I'm considering a patch to the load_via_fgets code to use sscanf_l where available, which will allow passing in a locale for the call, rather than modifying locale

Another alternative, which might make cleaner code, would be to use uselocale if it's available. That sets the locale for only the current thread, and then it can be set back to the previous locale at function exit. I might try this one first.

Thoughts welcome, and of course I'll contribute back patches.

amitdo commented 5 years ago

See #1670

stweil commented 5 years ago

Is assert still needed for non "C" LC_ALL?

The problem in load_via_fgets which was mentioned by @datalogics-kam still exists: sscanf has to be replaced by C++ stringstream like in the other places.

stweil commented 5 years ago

@datalogics-kam, which locale settings failed in your test? I'd like to reproduce your problem to see whether it is fixed by new code.

stweil commented 5 years ago

Function ReadParamDesc also still needs a replacement for sscanf.

2019-05-12: This was now done in pull request #2430. While implementing this, an unrelated bug was found and fixed, too.

datalogics-kam commented 5 years ago

@stweil I've been able to reproduce it with LC_ALL, LC_CTYPE, and LC_NUMERIC set to "en_US.UTF-8". That's what the JVM was setting. In the unit tests for our OCR wrapper, I added a fixture to recreate that locale setting.

Since we're still on 3.05.01 here, and I see that version 4 asserts that the locale must be "C", I'm going to put a fix in our code uses the C locale when calling Tesseract, and restores the locale after.

Since setlocale is global, if uselocale is available, our code will use that as it is thread-specific. I also found this code to set the locale on a thread basis on Windows: https://stackoverflow.com/a/17173977

Thanks for the insights!

stweil commented 5 years ago

How to test whether Tesseract code works with your locale:

The following patch disabled the assertions which check for the right locale and enables the current locale for all Tesseract code:

diff --git a/src/api/baseapi.cpp b/src/api/baseapi.cpp
index 61b38f8e..72e892b8 100644
--- a/src/api/baseapi.cpp
+++ b/src/api/baseapi.cpp
@@ -209,6 +209,9 @@ TessBaseAPI::TessBaseAPI()
       rect_height_(0),
       image_width_(0),
       image_height_(0) {
+#if 1
+  setlocale(LC_ALL, "");
+#else
   const char *locale;
   locale = std::setlocale(LC_ALL, nullptr);
   ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
@@ -216,6 +219,7 @@ TessBaseAPI::TessBaseAPI()
   ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
   locale = std::setlocale(LC_NUMERIC, nullptr);
   ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
+#endif
 }

 TessBaseAPI::~TessBaseAPI() {

With this patch, not only tesseract but also all other command line tools and the tests use the current locale. Run make check and see that several tests will fail depending on your locale.

stweil commented 5 years ago

Failing test on macOS with LANG=de_DE.UTF-8:

$ unittest/apiexample_test 
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EuroText
[ RUN      ] EuroText.FastLatinOCR
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../../src/ccutil/unicharset.h, line 874

2019-05-16: Fixed in pull request #2437

stweil commented 5 years ago

Failing test on macOS with LANG=de_DE.UTF-8:

$ unittest/baseapi_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 12 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 10 tests from TesseractTest
[ RUN      ] TesseractTest.ArraySizeTest
[       OK ] TesseractTest.ArraySizeTest (0 ms)
[ RUN      ] TesseractTest.BasicTesseractTest
[       OK ] TesseractTest.BasicTesseractTest (1251 ms)
[ RUN      ] TesseractTest.IteratesParagraphsEvenIfNotDetected
[       OK ] TesseractTest.IteratesParagraphsEvenIfNotDetected (347 ms)
[ RUN      ] TesseractTest.HOCRWorksWithoutSetInputName
[       OK ] TesseractTest.HOCRWorksWithoutSetInputName (403 ms)
[ RUN      ] TesseractTest.HOCRContainsBaseline
[       OK ] TesseractTest.HOCRContainsBaseline (389 ms)
[ RUN      ] TesseractTest.RickSnyderNotFuckSnyder
[       OK ] TesseractTest.RickSnyderNotFuckSnyder (346 ms)
[ RUN      ] TesseractTest.AdaptToWordStrTest
Trying to adapt "136
" to "1 3 6"
Trying to adapt "256
" to "2 5 6"
Trying to adapt "410
" to "4 1 0"
Trying to adapt "432
" to "4 3 2"
Trying to adapt "540
" to "5 4 0"
Trying to adapt "692
" to "6 9 2"
Trying to adapt "779
" to "7 7 9"
Trying to adapt "793
" to "7 9 3"
Trying to adapt "808
" to "8 0 8"
Trying to adapt "815
" to "8 1 5"
Trying to adapt "12
" to "1 2"
Trying to adapt "12
" to "1 2"
[       OK ] TesseractTest.AdaptToWordStrTest (788 ms)
[ RUN      ] TesseractTest.BasicLSTMTest
[       OK ] TesseractTest.BasicLSTMTest (4525 ms)
[ RUN      ] TesseractTest.LSTMGeometryTest
[       OK ] TesseractTest.LSTMGeometryTest (615 ms)
[ RUN      ] TesseractTest.InitConfigOnlyTest
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.232621 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.231864 in normproto file is not in unichar set.
[...]
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.233915 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.221755 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar ? in normproto file is not in unichar set.
baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug
[INFO]  Lang eng took 327ms in regular init[INFO]  Lang chi_tra took 1422ms in regular initAbort trap: 6

2019-05-18: Fixed in commit 36ed6da3499c93c2d04de29ee2f02f6d9975a1fe. 2019-05-18: malloc/free issue fixed in commit 09edd1a6048029f1578d5addaaaa065c1594a7d4.

stweil commented 5 years ago

@GitHubGS, this issue should be fixed now in branch 4.1 and in Git master. Can we close it?

tesseract-ocr / tesseract