Open philayres opened 2 years ago
000002_rasterize.png shows no text which could indicate orientation or skew angle. So the returned error code simply indicates that the requested operation could not be done. With --psm 0
it also returns an error code, but prints an additional error message.
I am not sure that the current behaviour should be changed. Maybe ocrmypdf
should be changed to accept an error code for empty pages.
It seems strange to me that a non zero error code would be returned. The requested action completed successfully, but the end result was no text, which is a valid result.
I would be less inclined to argue this, but the -c 'textonly_pdf=1'
option doesn't return an error code, so the results are inconsistent.
I haven't dug into the code, but does error code 1 consistently mean "no text found"? Or could other errors or results also return error code 1. With no prior knowledge of what is going to be found in an image, there needs to be a way to know whether a real error occurred, or just the document was empty and no text was returned.
That is my argument anyway. I'm guessing the ocrmypdf
developers were not expecting a non zero error code, or something has changed, since this seems like an obvious test case that would have failed on their side. That leads me to suggest this is a bug in tesseract.
Environment
This is running on a Centos 7 machine running a GNOME desktop.
FYI, tessaract was installed with Anaconda today.
Current Behavior:
I have a simple, mostly blank image file on which I run
It immediately returns with return code 1
Doing the same with a similar image
000001_rasterize.png
returnsReturn code is 0 as expected.
As you may be able to tell, these images came out of an
ocrmypng
pipeline, which crashes on the bad image:ocrmypdf -v 1 --deskew the_scientific_method-print10.pdf the_scientific_method-printed-ocr.pdf
Without --deskew, this runs through fine, but the tesseract command being run is different (it does something like this...)
This returns code 0 and a blank .txt file as expected.
The files are downloadable from Google Drive:
000002_rasterize.png
000001_rasterize.png
the_scientific_method-print10.pdf
Expected Behavior:
Failed image would return no text, not an error code.