tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.13k stars 9.39k forks source link

SIGSEGV - `new tesseract::TessBaseAPI()` segfaults on Android #2151

Closed rhardih closed 5 years ago

rhardih commented 5 years ago

I'm in the process of migrating from using version 3.05.01 to version 4.0.0 in an Android app and I'm seeing a SEGFAULT already when trying to create a new base api.

I'm building using the r18b Android NDK and this is my build settings:

cmake \
  -G "Android Gradle - Ninja" \
  -D ANDROID_ABI=armeabi-v7a \
  -D ANDROID_NATIVE_API_LEVEL=23 \
  -D BUILD_TESTS=OFF \
  -D BUILD_TRAINING_TOOLS=OFF \
  -D CMAKE_BUILD_TYPE=Debug \
  -D CMAKE_INSTALL_PREFIX:PATH=/tesseract-build \
  -D CMAKE_MAKE_PROGRAM=/android-sdk/cmake/3.6.4111459/bin/ninja \
  -D CMAKE_TOOLCHAIN_FILE=/android-sdk/ndk-bundle/build/cmake/android.toolchain.cmake \
  -D CPPAN_BUILD=OFF \
  ..

Full setup can be seen in this dockerfile: tesseract-4.0.0.Dockerfile.

The build runs fine, and the resultant .so file seems ok as well:

$ file extracted/tesseract-4.0.0-armv7-a-build/lib/libtesseract.so
extracted/tesseract-4.0.0-armv7-a-build/lib/libtesseract.so: ELF 32-bit LSB shared object, ARM, EABI5 version 1 (SYSV), dynamically linked, BuildID[sha1]=6a0dcecff780e71292e3952aaf3647753d450768, with debug_info, not stripped

I have a small Qt unit test, which doesn't do anything else than just trying to create a new instance of the base api. Source can be seen here: tst_tesseract_4_0_0.cpp.

It's pretty simple and just links in tesseract and leptonica, and bundles libtiff.

It is based off of the same test for v3.05.01 which runs just fine: tst_tesseract_3_05_01.cpp.


The segfault appears o be caused by the locale assertion introduced in https://github.com/tesseract-ocr/tesseract/commit/3292484f67af8bdda23aa5e510918d0115785291, because the value of locale in my case is C.UTF-8.

The way this fails, produces no errors or warnings about the value of locale and I just stumbled upon this by, because I happened to have source maps set up in debugging.


Environment

Current Behavior:

SIGSEGV when calling new tesseract::TessBaseAPI().

Expected Behavior:

No segfaults.

Suggested Fix:

Provide some sort of warning about the locale maybe.

CC @stweil

rhardih commented 5 years ago

Also, I'm not really sure what I'm supposed to do as a reasonable fix in my case.

Is "C" vs. "C.UTF-8" actually grounds for failing here?

zdenop commented 5 years ago

IMO this is duplicate for https://github.com/tesseract-ocr/tesseract/issues/1670

rhardih commented 5 years ago

I think you're right. I've subscribed to the other issue.