It looks like Paperless tries to parse every file as a PDF. That's fine, but it generates a lot of log spew for non-PDF files.
Suggestion: Don't log warnings about failures to parse a PDF unless the file has a .pdf extension.
To Reproduce
Put a JPG file (with .jpg extension) in the consumption folder
sudo journalctl -u paperless-scheduler
Expected behavior
Paperless doesn't generate errors about trying to parse the file as a PDF
Webserver logs
Jun 05 14:37:04 png python3[72005]: [2021-06-05 20:37:04,132] [INFO] [paperless.consumer] Consuming 2021_06_05 12_58 PM Office Lens 1.jpg
Jun 05 14:37:04 png python3[72005]: [2021-06-05 20:37:04,239] [WARNING] [paperless.parsing.tesseract] Error while getting text from PDF document with pdfminer.s>
Jun 05 14:37:04 png python3[72005]: Traceback (most recent call last):
Jun 05 14:37:04 png python3[72005]: File "/opt/paperless-ng/src/paperless_tesseract/parsers.py", line 120, in extract_text
Jun 05 14:37:04 png python3[72005]: stripped = post_process_text(pdfminer_extract_text(pdf_file))
Jun 05 14:37:04 png python3[72005]: File "/opt/paperless-ng/.venv/lib/python3.8/site-packages/pdfminer/high_level.py", line 114, in extract_text
Jun 05 14:37:04 png python3[72005]: for page in PDFPage.get_pages(
Jun 05 14:37:04 png python3[72005]: File "/opt/paperless-ng/.venv/lib/python3.8/site-packages/pdfminer/pdfpage.py", line 128, in get_pages
Jun 05 14:37:04 png python3[72005]: doc = PDFDocument(parser, password=password, caching=caching)
Jun 05 14:37:04 png python3[72005]: File "/opt/paperless-ng/.venv/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 596, in __init__
Jun 05 14:37:04 png python3[72005]: raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
Jun 05 14:37:04 png python3[72005]: pdfminer.pdfparser.PDFSyntaxError: No /Root object! - Is this really a PDF?
Relevant information
Host OS of the machine running paperless: Ubuntu 20.04
Browser: N/A
Version: 1.4.4
Installation method: bare metal (Ansible)
Any configuration changes you made in docker-compose.yml, docker-compose.env or paperless.conf: N/A
I can confirm this bug. I also happens for JPG files with .jpeg ending.
Host OS of the machine running paperless: Ubuntu 20.04.2
Version: 1.4.4
Installation method: Docker
Describe the bug
First, great project! Thanks so much for making it, and I hope you consider allowing users to provide regular donations.
It looks like Paperless tries to parse every file as a PDF. That's fine, but it generates a lot of log spew for non-PDF files.
Suggestion: Don't log warnings about failures to parse a PDF unless the file has a
.pdf
extension.To Reproduce
.jpg
extension) in the consumption foldersudo journalctl -u paperless-scheduler
Expected behavior
Paperless doesn't generate errors about trying to parse the file as a PDF
Webserver logs
Relevant information
docker-compose.yml
,docker-compose.env
orpaperless.conf
: N/A