Open talan-z opened 2 years ago
pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.7_8hxrkb/origin.pdf: unable to find trailer dictionary while recovering damaged file
Looks like a broken PDF file to me.
pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.7_8hxrkb/origin.pdf: unable to find trailer dictionary while recovering damaged file
Looks like a broken PDF file to me.
That's correct and not the issue that I highlight here. The parsing thread crashes as a result and the file is simply "lost". That should not be the case and is a bug from my point of view.
Hello! First of all, paperless-ng is awesome! There seems to be a bug/issue in ocrmypdf that stops the parsing thread. As a result, the document is not added to paperless.
Ideally, the OCR should not fail, however, I understand that this is not paperless' fault but ocrmypdf's issue. However, as a result, paperless completely exits the parsing thread and does not even put the documents into the database. This definitely should not happen.
Here's the error log:
`[2021-08-16 23:03:00,888] [INFO] [paperless.consumer] Consuming FeesReceipt70292.pdf
[2021-08-16 23:03:00,890] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-08-16 23:03:00,914] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-08-16 23:03:00,922] [DEBUG] [paperless.consumer] Parsing FeesReceipt70292.pdf...
[2021-08-16 23:03:01,144] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-mail-se_l776e
[2021-08-16 23:03:01,380] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-mail-se_l776e', 'output_file': '/tmp/paperless/paperless-0c2htufj/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng+deu', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-0c2htufj/sidecar.txt'}
[2021-08-16 23:03:01,408] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: . Attempting force OCR to get the text.
[2021-08-16 23:03:01,409] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-mail-af62hz1v', 'output_file': '/tmp/paperless/paperless-xm7g_mo6/archive-fallback.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng+deu', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-xm7g_mo6/sidecar-fallback.txt'}
[2021-08-16 23:03:01,799] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-xm7g_mo6
[2021-08-16 23:03:01,807] [ERROR] [paperless.consumer] Error while consuming document FeesReceipt66482.pdf: InputFileError:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 163, in get_pdfinfo
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/pdfinfo/info.py", line 857, in init
File "/usr/local/lib/python3.7/site-packages/pikepdf/_methods.py", line 955, in open
pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.sm5oc8iv/origin.pdf: unable to find trailer dictionary while recovering damaged file
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 241, in parse
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 340, in ocr
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 365, in run_pipeline
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 168, in get_pdfinfo
ocrmypdf.exceptions.InputFileError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 163, in get_pdfinfo
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/pdfinfo/info.py", line 857, in init
File "/usr/local/lib/python3.7/site-packages/pikepdf/_methods.py", line 955, in open
pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.7_8hxrkb/origin.pdf: unable to find trailer dictionary while recovering damaged file
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 276, in parse
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 340, in ocr
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 365, in run_pipeline
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 168, in get_pdfinfo
ocrmypdf.exceptions.InputFileError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in parse
documents.parsers.ParseError: InputFileError:
[2021-08-16 23:03:01,942] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: . Attempting force OCR to get the text.
[2021-08-16 23:03:01,943] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-mail-se_l776e', 'output_file': '/tmp/paperless/paperless-0c2htufj/archive-fallback.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng+deu', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-0c2htufj/sidecar-fallback.txt'}
[2021-08-16 23:03:02,337] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-0c2htufj
[2021-08-16 23:03:02,346] [ERROR] [paperless.consumer] Error while consuming document FeesReceipt70292.pdf: InputFileError:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 163, in get_pdfinfo
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/pdfinfo/info.py", line 857, in init
File "/usr/local/lib/python3.7/site-packages/pikepdf/_methods.py", line 955, in open
pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.2nud4r6r/origin.pdf: unable to find trailer dictionary while recovering damaged file
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 241, in parse
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 340, in ocr
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 365, in run_pipeline
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 168, in get_pdfinfo
ocrmypdf.exceptions.InputFileError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 163, in get_pdfinfo
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/pdfinfo/info.py", line 857, in init
File "/usr/local/lib/python3.7/site-packages/pikepdf/_methods.py", line 955, in open
pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.g_erw2ru/origin.pdf: unable to find trailer dictionary while recovering damaged file
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 276, in parse
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 340, in ocr
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 365, in run_pipeline
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 168, in get_pdfinfo
ocrmypdf.exceptions.InputFileError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in parse
documents.parsers.ParseError: InputFileError:`