jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

Consumer errors on certain PDFs #180

Closed kylebavis closed 3 years ago

kylebavis commented 3 years ago

I have a specific PDF failing to import when I drop it into the consumer folder. If I run the file through the ocrmypdf utility manually before trying to import it, using just the default settings, i.e. ocrmypdf input.pdf output.pdf, the consumer processes the new file without issue. The original PDF doesn't have an embedded text layer.

I see this error in the logs:

ERROR 2020-12-23 17:59:28,505 loggers Error while consuming document example.pdf: 'NoneType' object has no attribute 'userunit' 17:59:28 [Q] ERROR Failed [example.pdf] - 'NoneType' object has no attribute 'userunit' : Traceback (most recent call last): File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 216, in parse ocrmypdf.ocr(**ocr_args) File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 316, in ocr return run_pipeline(options=options, plugin_manager=plugin_manager, api=True) File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 366, in run_pipeline validate_pdfinfo_options(context) File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 177, in validate_pdfinfo_options if pdfinfo.has_userunit and options.output_type.startswith('pdfa'): File "/usr/local/lib/python3.7/site-packages/ocrmypdf/pdfinfo/info.py", line 840, in has_userunit return any(page.userunit != 1.0 for page in self.pages) File "/usr/local/lib/python3.7/site-packages/ocrmypdf/pdfinfo/info.py", line 840, in return any(page.userunit != 1.0 for page in self.pages) AttributeError: 'NoneType' object has no attribute 'userunit'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/src/paperless/src/documents/consumer.py", line 131, in try_consume_file document_parser.parse(self.path, mime_type) File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 237, in parse raise ParseError(e) documents.parsers.ParseError: 'NoneType' object has no attribute 'userunit'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker res = f(*task["args"], **task["kwargs"]) File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file override_tag_ids=override_tag_ids) File "/usr/src/paperless/src/documents/consumer.py", line 148, in try_consume_file raise ConsumerError(e) documents.consumer.ConsumerError: 'NoneType' object has no attribute 'userunit'

jonaswinkler commented 3 years ago

Could you also please open the Logs on paperless and set the filter to DEBUG?

There should be some line "Calling OCRmyPDF with ...", that will be useful. Also, anything between "Consuming document" and the error.

Since this does not happen when you're manually running that file through ocrmypdf, it must be some argument that causes this issue with OCRmyPDF. There's also probably nothing I can do about this, except forwarding this issue to the OCRmyPDF repo.

Also: Anything noteworthy about the document? Do other documents from the same source get added without issues?

kylebavis commented 3 years ago

Debug output:

12/23/20, 1:06 PM ERROR Error while consuming document example.pdf: 'NoneType' object has no attribute 'userunit'

12/23/20, 1:06 PM DEBUG Deleting directory /tmp/paperless/paperless-09zw1r49

12/23/20, 1:06 PM DEBUG Calling OCRmyPDF with {'input_file': '/usr/src/paperless/src/../consume/example.pdf', 'output_file': '/tmp/paperless/paperless-09zw1r49/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'clean': True, 'redo_ocr': True}

12/23/20, 1:06 PM DEBUG No text was found in the document and skip is specified. Upgrading OCR mode to redo.

12/23/20, 1:06 PM DEBUG Parsing example.pdf...

12/23/20, 1:06 PM DEBUG Parser: RasterisedDocumentParser based on mime type application/pdf

12/23/20, 1:06 PM INFO Consuming example.pdf

I am not especially knowledgeable about the PDF format, but I did poke through the document properties in Acrobat and nothing stood out to me. It's a policy document I downloaded from my insurance company, so presumably it's . Interestingly, some docs from that same source import just fine. About 50% of them fail.

I just re-tested the OCR bit using the same arguments in the debug output and got the same error message when I added redo-ocr, so that seems to be the problem. As you said, not a paperless-ng issue.

kylebavis commented 3 years ago

As an aside, ocrmypdf works on this document when I pass force-ocr instead of redo-ocr; I didn't test this inside of paperless, but I assume the result would be the same if I set the OCR mode via PAPERLESS_OCR_MODE to force. It would be nice to give us the option to fall back to force if redo fails, but I'm probably just going to clean up any failing documents manually now that I know an easy workaround.

jonaswinkler commented 3 years ago

Not doing a fall back to --force-ocr, since that mode rasterizes PDF documents (creates images for all pages).

Since this is from an insurance company, it might be something related to security, some means they used to make modifications to the file harder, I don't know. I'll post that over at OCRmyPDF.

jonaswinkler commented 3 years ago

How many pages does the document have? (Will probably be helpful if you feel like sharing)

jonaswinkler commented 3 years ago

It would also be helpful for the author if you could run OCRmyPDF with -v1, check that for sensitive information, and post that over there.

kylebavis commented 3 years ago

The file I used for testing has 30 pages. I just loaded in a batch of failing files (from the same source site) that all threw the same error message; that batch ranged in size from 4-8 pages. I'll update your issue with the output from -v1 shortly.

jonaswinkler commented 3 years ago

I'm just moving the discussion of many related issues into one new issue.