Closed gbolin closed 5 years ago
if I load 'eng.traineddata', works fine, even load self-trained data file, it also works fine.
Seems like an issue related to locale settings. https://www.google.co.il/search?q=osx+%22pycharm%22+%22locale%22
@amitdo hi, I tried, but seems not working, any other ideas?
Hi all,
I want to let tesseract to output temperary image files to local disk. So I can know that where step's result. I compile the tesseract with "--enable-debug" but after recognize the image, I cannot find the temoprery image files. Is there anyone meet the similar problem? Thanks.
@amitdo finally I got the reason, it relates the "locale". here is the explanation.
after "combine_tessdata -U chi_sim.traineddata ./chi_sim.", generate a file named "chi_sim.unicharset"(This file is the key reason why non-eng traineddata files somehow could not be loaded). This function "bool UNICHARSET::load_via_fgets" in "tesseract/ccutil/unicharset.cpp:789" would read that unicharset file row by row, when arriving here
(v = sscanf(buffer, "%s %x %d,%d,%d,%d,%g,%g,%g,%g,%g,%g %63s %d %d %d %63s", unichar, &properties, &min_bottom, &max_bottom, &min_top, &max_top, &width, &width_sd, &bearing, &bearing_sd, &advance, &advance_sd, script, &other_case, &direction, &mirror, normed)) != 17
let's say buffer is "格 1 63,69,255,255,192,220,0,9,205,233 Han 7 0 7 格 # 格 [683c ]x"
sscanf function will call for isspace function, the letter "格“ utf-8 code is:0xE6 0xA0 0xBC,
the 0xA0 was recognized as a space. so a buffer interruption happens here. That is the key reason!
Tesseract will call std::locale to get the default locale setting, but exactly in unicharset.cpp, it causes sscanf function fail.
not only in Chinese language, but for others , after UTF8-based locale, if a character contains some special bits value, like '0xA0', '0x85', more, especially non-english operating system, it absolutely will fail.
how to solve:
1:change system into English, but maybe not a good idea, butit works for me.
2:change the unicharset.cpp source code, I tried on my own mac os, like this:
..... from row 823
char normed[64];
int v = -1;
************Add code ************
locale lc("C");
locale::global(lc);
************************
if (fgets_cb->Run(buffer, sizeof (buffer)) == NULL ||
.....continue
@amitdo thanks a lot for your reading. regards, GS.
@stweil, your thoughts on the suggested change?
@GitHubGS, which locale did you use when it failed?
@amitdo sorry for replying late. after input 'env' command in terminal, these following 2 pictures show what you need. the first one, set the os language into Chinese, while the 2nd one set English.
it seems that only in english environment, I saw the locale value. hope it's helpful to you.
Hey all, if you're just coming across this issue, I solved it by setting the locale in Python thus at the top of my script:
import locale
locale.setlocale(locale.LC_ALL, "C")
I confirm that I also ran into this problem with the R bindings. All is fine for most languages, however asian languages like jpn
and kor
would not load with en_US.UTF-8
.
A workaround is to set Sys.setlocale('LC_CTYPE', 'C')
and then it works. However it is unclear to me if I can set it back to en_US.UTF-8
afterwards.
This is 3.05.01
by the way.
I have the same issue with the following env with idea.
macOS 10.13.4 tesseract 4.00.00alpha leptonica-1.76.0 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 Found AVX2 Found AVX Found SSE
my workaround is adding a environment variable LC_CTYPE=C in idea, it works.
Pull request #1649 makes Tesseract initialization fail if the locale settings are wrong. Users who get that failure must set the "C" locale in their code.
Technically this issue was closed by enforcing the "C" locale in pull request #1649, but that causes problems requiring ugly workarounds in projects which use the Tesseract API from Python, Java or other languages which typically don't set the "C" locale.
Therefore I suggest to keep it open.
@stweil : what about my suggestion to implement jeroen code into TessBaseAPI::Init?
Make sure the environment variable TESSDATA_PREFIX
is set to your tessdata directory!
(for ex. C:\msys64\mingw32\share\tessdata
).
@stweil : Is assert still needed for non "C" LC_ALL?
It turns out that we're having this problem as well, on macOS with 3.05.01. I'm considering a patch to the load_via_fgets
code to use sscanf_l
where available, which will allow passing in a locale for the call, rather than modifying locale
Another alternative, which might make cleaner code, would be to use uselocale
if it's available. That sets the locale for only the current thread, and then it can be set back to the previous locale at function exit. I might try this one first.
Thoughts welcome, and of course I'll contribute back patches.
See #1670
Is assert still needed for non "C" LC_ALL?
The problem in load_via_fgets
which was mentioned by @datalogics-kam still exists: sscanf
has to be replaced by C++ stringstream
like in the other places.
@datalogics-kam, which locale settings failed in your test? I'd like to reproduce your problem to see whether it is fixed by new code.
Function ReadParamDesc
also still needs a replacement for sscanf
.
2019-05-12: This was now done in pull request #2430. While implementing this, an unrelated bug was found and fixed, too.
@stweil I've been able to reproduce it with LC_ALL
, LC_CTYPE
, and LC_NUMERIC
set to "en_US.UTF-8"
. That's what the JVM was setting. In the unit tests for our OCR wrapper, I added a fixture to recreate that locale setting.
Since we're still on 3.05.01 here, and I see that version 4 asserts that the locale must be "C", I'm going to put a fix in our code uses the C locale when calling Tesseract, and restores the locale after.
Since setlocale
is global, if uselocale
is available, our code will use that as it is thread-specific. I also found this code to set the locale on a thread basis on Windows: https://stackoverflow.com/a/17173977
Thanks for the insights!
How to test whether Tesseract code works with your locale:
The following patch disabled the assertions which check for the right locale and enables the current locale for all Tesseract code:
diff --git a/src/api/baseapi.cpp b/src/api/baseapi.cpp
index 61b38f8e..72e892b8 100644
--- a/src/api/baseapi.cpp
+++ b/src/api/baseapi.cpp
@@ -209,6 +209,9 @@ TessBaseAPI::TessBaseAPI()
rect_height_(0),
image_width_(0),
image_height_(0) {
+#if 1
+ setlocale(LC_ALL, "");
+#else
const char *locale;
locale = std::setlocale(LC_ALL, nullptr);
ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
@@ -216,6 +219,7 @@ TessBaseAPI::TessBaseAPI()
ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
locale = std::setlocale(LC_NUMERIC, nullptr);
ASSERT_HOST(!strcmp(locale, "C") || !strcmp(locale, "C.UTF-8"));
+#endif
}
TessBaseAPI::~TessBaseAPI() {
With this patch, not only tesseract
but also all other command line tools and the tests use the current locale. Run make check
and see that several tests will fail depending on your locale.
Failing test on macOS with LANG=de_DE.UTF-8
:
$ unittest/apiexample_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EuroText
[ RUN ] EuroText.FastLatinOCR
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../../src/ccutil/unicharset.h, line 874
2019-05-16: Fixed in pull request #2437
Failing test on macOS with LANG=de_DE.UTF-8
:
$ unittest/baseapi_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 12 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 10 tests from TesseractTest
[ RUN ] TesseractTest.ArraySizeTest
[ OK ] TesseractTest.ArraySizeTest (0 ms)
[ RUN ] TesseractTest.BasicTesseractTest
[ OK ] TesseractTest.BasicTesseractTest (1251 ms)
[ RUN ] TesseractTest.IteratesParagraphsEvenIfNotDetected
[ OK ] TesseractTest.IteratesParagraphsEvenIfNotDetected (347 ms)
[ RUN ] TesseractTest.HOCRWorksWithoutSetInputName
[ OK ] TesseractTest.HOCRWorksWithoutSetInputName (403 ms)
[ RUN ] TesseractTest.HOCRContainsBaseline
[ OK ] TesseractTest.HOCRContainsBaseline (389 ms)
[ RUN ] TesseractTest.RickSnyderNotFuckSnyder
[ OK ] TesseractTest.RickSnyderNotFuckSnyder (346 ms)
[ RUN ] TesseractTest.AdaptToWordStrTest
Trying to adapt "136
" to "1 3 6"
Trying to adapt "256
" to "2 5 6"
Trying to adapt "410
" to "4 1 0"
Trying to adapt "432
" to "4 3 2"
Trying to adapt "540
" to "5 4 0"
Trying to adapt "692
" to "6 9 2"
Trying to adapt "779
" to "7 7 9"
Trying to adapt "793
" to "7 9 3"
Trying to adapt "808
" to "8 0 8"
Trying to adapt "815
" to "8 1 5"
Trying to adapt "12
" to "1 2"
Trying to adapt "12
" to "1 2"
[ OK ] TesseractTest.AdaptToWordStrTest (788 ms)
[ RUN ] TesseractTest.BasicLSTMTest
[ OK ] TesseractTest.BasicLSTMTest (4525 ms)
[ RUN ] TesseractTest.LSTMGeometryTest
[ OK ] TesseractTest.LSTMGeometryTest (615 ms)
[ RUN ] TesseractTest.InitConfigOnlyTest
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.232621 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.231864 in normproto file is not in unichar set.
[...]
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.233915 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.221755 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar ? in normproto file is not in unichar set.
baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug
[INFO] Lang eng took 327ms in regular init[INFO] Lang chi_tra took 1422ms in regular initAbort trap: 6
2019-05-18: Fixed in commit 36ed6da3499c93c2d04de29ee2f02f6d9975a1fe. 2019-05-18: malloc/free issue fixed in commit 09edd1a6048029f1578d5addaaaa065c1594a7d4.
@GitHubGS, this issue should be fixed now in branch 4.1 and in Git master. Can we close it?
Environment
Current Behavior:
my situation a little complicated. I made the tesseract into a lib which for other application to call, while in the "api->init" to load chi_sim, it failed, ONLY in IDE(pycharm) environment, after debugging, I located this function "load_via_fgets" in file "tesseract/ccutil/unicharset.cpp", from row 825, sscanf return 1/1/1/1/1/1/1 rather then 17/16/10/8/4/3/2, so it return 'false' to function "bool UNICHARSET::load_from_file(tesseract::TFile *file, bool skip_fragments)" in row 781. ATTENTION, this situation wont happen in terminal command line, only in IDE, also found a same problem happened in tess4j, link:. looking forward to hearing from you, thanks so much.
Expected Behavior:
Suggested Fix: