jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.38k stars 352 forks source link

[BUG] paperless-scheduler spams logs with PDF syntax errors for non-PDF files #1102

Closed mtlynch closed 3 years ago

mtlynch commented 3 years ago

Describe the bug

First, great project! Thanks so much for making it, and I hope you consider allowing users to provide regular donations.

It looks like Paperless tries to parse every file as a PDF. That's fine, but it generates a lot of log spew for non-PDF files.

Suggestion: Don't log warnings about failures to parse a PDF unless the file has a .pdf extension.

To Reproduce

  1. Put a JPG file (with .jpg extension) in the consumption folder
  2. sudo journalctl -u paperless-scheduler

Expected behavior

Paperless doesn't generate errors about trying to parse the file as a PDF

Webserver logs

Jun 05 14:37:04 png python3[72005]: [2021-06-05 20:37:04,132] [INFO] [paperless.consumer] Consuming 2021_06_05 12_58 PM Office Lens 1.jpg
Jun 05 14:37:04 png python3[72005]: [2021-06-05 20:37:04,239] [WARNING] [paperless.parsing.tesseract] Error while getting text from PDF document with pdfminer.s>
Jun 05 14:37:04 png python3[72005]: Traceback (most recent call last):
Jun 05 14:37:04 png python3[72005]:   File "/opt/paperless-ng/src/paperless_tesseract/parsers.py", line 120, in extract_text
Jun 05 14:37:04 png python3[72005]:     stripped = post_process_text(pdfminer_extract_text(pdf_file))
Jun 05 14:37:04 png python3[72005]:   File "/opt/paperless-ng/.venv/lib/python3.8/site-packages/pdfminer/high_level.py", line 114, in extract_text
Jun 05 14:37:04 png python3[72005]:     for page in PDFPage.get_pages(
Jun 05 14:37:04 png python3[72005]:   File "/opt/paperless-ng/.venv/lib/python3.8/site-packages/pdfminer/pdfpage.py", line 128, in get_pages
Jun 05 14:37:04 png python3[72005]:     doc = PDFDocument(parser, password=password, caching=caching)
Jun 05 14:37:04 png python3[72005]:   File "/opt/paperless-ng/.venv/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 596, in __init__
Jun 05 14:37:04 png python3[72005]:     raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
Jun 05 14:37:04 png python3[72005]: pdfminer.pdfparser.PDFSyntaxError: No /Root object! - Is this really a PDF?

Relevant information

rYR79435 commented 3 years ago

I can confirm this bug. I also happens for JPG files with .jpeg ending. Host OS of the machine running paperless: Ubuntu 20.04.2 Version: 1.4.4 Installation method: Docker

jonaswinkler commented 3 years ago

Should be fixed in 1.4.5.