tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.11k stars 9.5k forks source link

Simple file causes error code 1 #3675

Open philayres opened 2 years ago

philayres commented 2 years ago

Environment

tesseract 5.0.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.1 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.5.2 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.0

This is running on a Centos 7 machine running a GNOME desktop.

Linux hostname 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

FYI, tessaract was installed with Anaconda today.

Current Behavior:

I have a simple, mostly blank image file on which I run

tesseract -l eng --psm 2 '/tmp/ocrmypdf.io.4npvxhru/000002_rasterize.png' stdout

It immediately returns with return code 1

Doing the same with a similar image 000001_rasterize.png returns

Orientation: 0
WritingDirection: 0
TextlineOrder: 2
Deskew angle: 0.0000

Return code is 0 as expected.

As you may be able to tell, these images came out of an ocrmypng pipeline, which crashes on the bad image:

ocrmypdf -v 1 --deskew the_scientific_method-print10.pdf the_scientific_method-printed-ocr.pdf

Without --deskew, this runs through fine, but the tesseract command being run is different (it does something like this...)

tesseract -l eng -c 'textonly_pdf=1' '/tmp/ocrmypdf.io.4npvxhru/000002_rasterize.png' new-file pdf txt

This returns code 0 and a blank .txt file as expected.


The files are downloadable from Google Drive:

000002_rasterize.png

000001_rasterize.png

the_scientific_method-print10.pdf

Expected Behavior:

Failed image would return no text, not an error code.

stweil commented 2 years ago

000002_rasterize.png shows no text which could indicate orientation or skew angle. So the returned error code simply indicates that the requested operation could not be done. With --psm 0 it also returns an error code, but prints an additional error message.

I am not sure that the current behaviour should be changed. Maybe ocrmypdf should be changed to accept an error code for empty pages.

philayres commented 2 years ago

It seems strange to me that a non zero error code would be returned. The requested action completed successfully, but the end result was no text, which is a valid result.

I would be less inclined to argue this, but the -c 'textonly_pdf=1' option doesn't return an error code, so the results are inconsistent.

I haven't dug into the code, but does error code 1 consistently mean "no text found"? Or could other errors or results also return error code 1. With no prior knowledge of what is going to be found in an image, there needs to be a way to know whether a real error occurred, or just the document was empty and no text was returned.

That is my argument anyway. I'm guessing the ocrmypdf developers were not expecting a non zero error code, or something has changed, since this seems like an obvious test case that would have failed on their side. That leads me to suggest this is a bug in tesseract.