Closed jmgilman closed 7 years ago
Weird. :/
Based on what you wrote, the only problem I could imagine causing that would be a permission problem on /tmp (or /tmp full).
Do you feel like patching Pyocr to see what's going on ?
If so:
git clone https://github.com/openpaperwork/pyocr.git
cd pyocr
patch -p1 << EOF
diff --git a/src/pyocr/tesseract.py b/src/pyocr/tesseract.py
index 99b0121..39b26e7 100755
--- a/src/pyocr/tesseract.py
+++ b/src/pyocr/tesseract.py
@@ -255,6 +255,7 @@ def run_tesseract(input_filename, output_filename_base, lang=None,
if configs is not None:
command += configs
+ print ("TESSERACT COMMAND: {}".format(command))
proc = subprocess.Popen(command,
startupinfo=g_subprocess_startup_info,
creationflags=g_creation_flags,
EOF
python3 ./setup.py install
It will let you see what command is run exactly by pyocr.
So I ended up extensively reverse engineering this last night from the REPL and came across some really inconclusive findings. From what I could conclude - I can run all of the necessary code to essentially emulate both image_to_string
and run_tesseract
and it works fine.
However, the second I throw in the deal with using a temporary directory, the code fails. To make matters even more confusing, it's actually the context manager around the temporary directory that's causing the error, as I can directly call tempfile
to get a directory and then emulate the code with success. An example:
import os
import tempfile
import pyocr.builders
from PIL import Image
from pyocr.tesseract import image_to_string
from pyocr.tesseract import run_tesseract
path = tempfile.mkdtemp(prefix='tess_')
f = Image.open('/app/data/paperless/tmp/paperlessu22lit2g/convert-0000.unpaper.pnm')
f.save(os.path.join(path, "input.bmp"))
lang = 'eng'
builder = pyocr.builders.TextBuilder()
# This works
(status, errors) = run_tesseract("input.bmp", "output", cwd=path,
lang=lang,
flags=builder.tesseract_flags,
configs=builder.tesseract_configs)
# This does not
image_to_string(f, 'eng')
Also, I apologize for failing to mention a critical component: This code is being run in a read-only Docker container. Meaning, everything except files in /app/data
and /tmp
are set to read-only.
However, even with that being the case, I more or less proved above that there are no issues with /tmp
until are context manager is used. Is there a way to force pyocr to use a different temporary directory for further testing?
Pyocr uses tempfile.NamedTemporaryFile()
which in turn uses os.mkstemp()
. According to the Python documentation, you can use the env variables TMPDIR
, TEMP
or TMP
to define the directory to use.
There is another alternative: You could use pyocr.libtesseract
instead of pyocr.tesseract
.
pyocr.libtesseract
is a binding for libtesseract. It doesn't fork()/exec() and doesn't use temporary files at all. The drawback is that if libtesseract (or Pyocr) crashes, it crashes with a SIGSEGV instead of a Python exception.
If you want to try it, instead of:
import pyocr
ocr_tool = pyocr.get_available_tools()[0]
Try:
import pyocr.libtesseract
ocr_tool = pyocr.libtesseract
BTW, have you tried running manually the very same command than Pyocr, to try to get the exact and complete error message ?
Oh I missed this one:
The code -9 comes from subprocess.Popen().wait()
, and documentation states that:
A negative value -N indicates that the child was terminated by signal N (POSIX only).
And signal 9 is SIGKILL.
So basically, your Tesseract process has been killed by another process using SIGKILL. Which also explains the truncated output you got.
Thanks for the information.
This is a very peculiar problem, it seems almost like AppArmor doesn't like the way in which the code is executing. Using:
import pypcr.libtesseract
ocr_tool = pyocr.libtesseract
ocr.image_to_string(f, 'eng')
Simply returns:
Killed
I also had set the environment variable to move the temp directory location to a confirmed read/write directory. I'll try running the patch to get the exact command being executed and confirm if I can run it myself outside of Python.
Killed
Wow, yeah, this is unexpected. Pyocr.libtesseract
is pretty much a traditional Python/C binding. The only thing kind-of-unusual is that it uses ctypes
(--> pure Python) and loads libtesseract
dynamically.
So here was the result:
TESSERACT COMMAND: ['tesseract', 'input.bmp', 'output', '-l', 'eng', '-psm', '3']
Running that exact command from the command line works without issues, so it's definitely something with the way subprocess is running it.
That's weird, filepaths should have been something like /tmp/tess_xxxx.bmp
and /tmp/tess_output
.
Which version of Pyocr do you use ?
The above result was from installing and running the cloned repository, so whatever is on master.
Line 352 here saves the file to input.bmp
: https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/tesseract.py
After more testing, the problem was due to memory limits. Somehow calling the process directly wasn't exceeding the container memory limit, but subprocessing through Python was exceeding the container's memory limit.
Thanks for the help.
You're welcome :)
BTW, based on what you told me, I think the root of the problem is that, at some point, the image is uncompressed as a bitmap in memory. It cannot be avoided.
Just out of curiosity: what memory limit did you have set before and what is the new one @jmgilman? Thanks!
It was 256MB and I moved it up to 1024MB.
Using the latest version of pyocr and attempting to parse text on a file run through
unpaper:
Is causing the follow stacktrace:
However, I can run the following command:
And it works without errors. After searching, I cannot find any reference to the -9 return value, and it seems like the error output is being truncated (it's just the top stdout when you first run Tesseract).
Suggestions?