tesseract.detect_orientation() dies with empty pages

ghost commented 8 years ago

Hi!

I'm encountering this error with some of my PDFs:

consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |    **** Warning: considering '0000000000 XXXXX n' as a free entry.
consumer_1 |
consumer_1 |    **** This file had errors that were repaired or ignored.
consumer_1 |    **** The file was produced by:
consumer_1 |    **** >>>> Mac OS X 10.8.2 Quartz PDFContext <<<<
consumer_1 |    **** Please notify the author of the software that produced this
consumer_1 |    **** file that it does not conform to Adobe's published PDF
consumer_1 |    **** specification.
consumer_1 |
consumer_1 | multiprocessing.pool.RemoteTraceback:
consumer_1 | """
consumer_1 | multiprocessing.pool.RemoteTraceback:
consumer_1 | """
consumer_1 | Traceback (most recent call last):
consumer_1 |   File "/usr/local/lib/python3.5/site-packages/pyocr/tesseract.py", line 171, in detect_orientation
consumer_1 |     angle = int(output['Orientation in degrees'])
consumer_1 | KeyError: 'Orientation in degrees'
consumer_1 |
consumer_1 | During handling of the above exception, another exception occurred:
consumer_1 |
consumer_1 | Traceback (most recent call last):
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 119, in worker
consumer_1 |     result = (True, func(*args, **kwds))
consumer_1 |   File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
consumer_1 |     return list(map(*args))
consumer_1 |   File "/usr/src/paperless/src/documents/consumer.py", line 32, in image_to_string
consumer_1 |     orientation = self.OCR.detect_orientation(f, lang=lang)
consumer_1 |   File "/usr/local/lib/python3.5/site-packages/pyocr/tesseract.py", line 180, in detect_orientation
consumer_1 |     % original_output)
consumer_1 | pyocr.tesseract.TesseractError: (-1, 'No script found in image (Too few characters. Skipping this page)')
consumer_1 | """

jflesch commented 8 years ago

1) PyOCR doesn't support PDF (libpoppler ?) as input. You must have a conversion process first. Are you sure the output of this process is ok ?

2) Tesseract output is given in the exception : 'No script found in image (Too few characters. Skipping this page)'. Did this exception happen on an empty page ?

ghost commented 8 years ago

1) I'm using https://github.com/danielquinn/paperless 2) I just checked the PDF and there is a page with just a small barcode and no text.

Does this mean that empty pages have to be exorcised from PDFs? That's a problem with legal documents.

Many thanks for the great project!

jflesch commented 8 years ago

1) Then please open a ticket at danielquinn/paperless first. They will open a ticket here if required. 2) Ok

Does this mean that empty pages have to be exorcised from PDFs? That's a problem with legal documents.

No, it means the exception has to be catched and handled correctly by the calling program. I can make it a more specific exception if it can help @danielquinn .

ghost commented 8 years ago

Okay, thanks - I'll close this.

jflesch commented 8 years ago

Hm actually, I'll keep this ticket open for now, because there are two things I must do:

Add a specific exception for this case (inheriting from TesseractError to not break current usage)
Update the doc to specify this exception can be raised

jflesch commented 8 years ago

Doc updated. I actually think the specific exception is not required.

openpaperwork / pyocr

tesseract.detect_orientation() dies with empty pages #33