openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
931 stars 152 forks source link

Cryptic TesseractError (-9) when processing image #69

Closed jmgilman closed 7 years ago

jmgilman commented 7 years ago

Using the latest version of pyocr and attempting to parse text on a file run through unpaper:

with open('test.unpaper.pnm', 'r') as f:
    text = ocr.image_to_string(f, lang='eng')

Is causing the follow stacktrace:

File "/usr/local/lib/python3.5/dist-packages/pyocr/tesseract.py", line 358, in image_to_string
    raise TesseractError(status, errors)
pyocr.error.TesseractError: (-9, b'Tesseract Open Source OCR Engine v3.04.01 with Leptonica\n')

However, I can run the following command:

tesseract test.unpaper.pnm output

And it works without errors. After searching, I cannot find any reference to the -9 return value, and it seems like the error output is being truncated (it's just the top stdout when you first run Tesseract).

Suggestions?

jflesch commented 7 years ago

Weird. :/

Based on what you wrote, the only problem I could imagine causing that would be a permission problem on /tmp (or /tmp full).

Do you feel like patching Pyocr to see what's going on ?

If so:

git clone https://github.com/openpaperwork/pyocr.git
cd pyocr

patch -p1 << EOF
diff --git a/src/pyocr/tesseract.py b/src/pyocr/tesseract.py
index 99b0121..39b26e7 100755
--- a/src/pyocr/tesseract.py
+++ b/src/pyocr/tesseract.py
@@ -255,6 +255,7 @@ def run_tesseract(input_filename, output_filename_base, lang=None,
     if configs is not None:
         command += configs

+    print ("TESSERACT COMMAND: {}".format(command))
     proc = subprocess.Popen(command,
                             startupinfo=g_subprocess_startup_info,
                             creationflags=g_creation_flags,
EOF

python3 ./setup.py install

It will let you see what command is run exactly by pyocr.

jmgilman commented 7 years ago

So I ended up extensively reverse engineering this last night from the REPL and came across some really inconclusive findings. From what I could conclude - I can run all of the necessary code to essentially emulate both image_to_string and run_tesseract and it works fine.

However, the second I throw in the deal with using a temporary directory, the code fails. To make matters even more confusing, it's actually the context manager around the temporary directory that's causing the error, as I can directly call tempfile to get a directory and then emulate the code with success. An example:

import os
import tempfile

import pyocr.builders

from PIL import Image
from pyocr.tesseract import image_to_string
from pyocr.tesseract import run_tesseract

path = tempfile.mkdtemp(prefix='tess_')
f = Image.open('/app/data/paperless/tmp/paperlessu22lit2g/convert-0000.unpaper.pnm')
f.save(os.path.join(path, "input.bmp"))

lang = 'eng'
builder = pyocr.builders.TextBuilder()

# This works
(status, errors) = run_tesseract("input.bmp", "output", cwd=path,
                                         lang=lang,
                                         flags=builder.tesseract_flags,
                                         configs=builder.tesseract_configs)

# This does not
image_to_string(f, 'eng')

Also, I apologize for failing to mention a critical component: This code is being run in a read-only Docker container. Meaning, everything except files in /app/data and /tmp are set to read-only.

However, even with that being the case, I more or less proved above that there are no issues with /tmp until are context manager is used. Is there a way to force pyocr to use a different temporary directory for further testing?

jflesch commented 7 years ago

Pyocr uses tempfile.NamedTemporaryFile() which in turn uses os.mkstemp(). According to the Python documentation, you can use the env variables TMPDIR, TEMP or TMP to define the directory to use.

There is another alternative: You could use pyocr.libtesseract instead of pyocr.tesseract. pyocr.libtesseract is a binding for libtesseract. It doesn't fork()/exec() and doesn't use temporary files at all. The drawback is that if libtesseract (or Pyocr) crashes, it crashes with a SIGSEGV instead of a Python exception.

If you want to try it, instead of:

import pyocr
ocr_tool = pyocr.get_available_tools()[0]

Try:

import pyocr.libtesseract
ocr_tool = pyocr.libtesseract
jflesch commented 7 years ago

BTW, have you tried running manually the very same command than Pyocr, to try to get the exact and complete error message ?

jflesch commented 7 years ago

Oh I missed this one:

The code -9 comes from subprocess.Popen().wait(), and documentation states that:

A negative value -N indicates that the child was terminated by signal N (POSIX only).

And signal 9 is SIGKILL.

So basically, your Tesseract process has been killed by another process using SIGKILL. Which also explains the truncated output you got.

jmgilman commented 7 years ago

Thanks for the information.

This is a very peculiar problem, it seems almost like AppArmor doesn't like the way in which the code is executing. Using:

import pypcr.libtesseract
ocr_tool = pyocr.libtesseract
ocr.image_to_string(f, 'eng')

Simply returns:

Killed

I also had set the environment variable to move the temp directory location to a confirmed read/write directory. I'll try running the patch to get the exact command being executed and confirm if I can run it myself outside of Python.

jflesch commented 7 years ago

Killed

Wow, yeah, this is unexpected. Pyocr.libtesseract is pretty much a traditional Python/C binding. The only thing kind-of-unusual is that it uses ctypes (--> pure Python) and loads libtesseract dynamically.

jmgilman commented 7 years ago

So here was the result:

TESSERACT COMMAND: ['tesseract', 'input.bmp', 'output', '-l', 'eng', '-psm', '3']

Running that exact command from the command line works without issues, so it's definitely something with the way subprocess is running it.

jflesch commented 7 years ago

That's weird, filepaths should have been something like /tmp/tess_xxxx.bmp and /tmp/tess_output. Which version of Pyocr do you use ?

jmgilman commented 7 years ago

The above result was from installing and running the cloned repository, so whatever is on master.

Line 352 here saves the file to input.bmp: https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/tesseract.py

jmgilman commented 7 years ago

After more testing, the problem was due to memory limits. Somehow calling the process directly wasn't exceeding the container memory limit, but subprocessing through Python was exceeding the container's memory limit.

Thanks for the help.

jflesch commented 7 years ago

You're welcome :)

jflesch commented 7 years ago

BTW, based on what you told me, I think the root of the problem is that, at some point, the image is uncompressed as a bitmap in memory. It cannot be avoided.

ddddavidmartin commented 7 years ago

Just out of curiosity: what memory limit did you have set before and what is the new one @jmgilman? Thanks!

jmgilman commented 7 years ago

It was 256MB and I moved it up to 1024MB.