nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.59k stars 372 forks source link

JVM crash due to C [libtesseract.so.4+0x251a7d] tesseract::HistogramRect(Pix*, int, int, int, int, int, int*)+0xfd #204

Closed wkoszycki closed 3 years ago

wkoszycki commented 3 years ago

During tif files processing folowing fatal error ocurring for some of the files

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8b08341a7d, pid=15619, tid=15755
#
# JRE version: OpenJDK Runtime Environment (11.0.9+11) (build 11.0.9+11-post-Debian-1deb10u1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.9+11-post-Debian-1deb10u1, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libtesseract.so.4+0x251a7d]  tesseract::HistogramRect(Pix*, int, int, int, int, int, int*)+0xfd

Versions:

tess4j 4.3.1
Linux version 4.19.0-10-amd64 (debian-kernel@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.132-1 (2020-07-24)
tesseract 4.0.0
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX
 Found SSE

I also set export LC_ALL=C tesseract.setOcrEngineMode(1);

nguyenq commented 3 years ago

The exception originated from the native code. HistogramRect method is defined in Otsu thresholding module otsuthr. You may want to trace through it to determine the root cause of why some of your images were not consumable by tesseract engine.

wkoszycki commented 3 years ago

@nguyenq thanks I will try to reproduce with pure tesseract and get back to you

wkoszycki commented 3 years ago

@nguyenq I have tried to run via cmd with all available psm options

tesseract -l pol --oem 1 --psm <0-13> input.tif output.txt

no error occurred

To replicate issue and get all options during tess4j execution I set logging.level.net.sourceforge.tess4j=DEBUG but there were additional logs. Is there a way to get exact info what is being executed underneath ?

nguyenq commented 3 years ago

Both Tess4J and Tesseract source code is available for your investigation. If you can set up your IDE for native code debugging, you'd be able to step from Tess4J's Java code into Tesseract's C++ code and observe what is going under the hood.

nguyenq commented 3 years ago

Not reproducible.