jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

[BUG] Parsing crashes after forcing OCR and "looses"document #1235

Open talan-z opened 2 years ago

talan-z commented 2 years ago

Hello! First of all, paperless-ng is awesome! There seems to be a bug/issue in ocrmypdf that stops the parsing thread. As a result, the document is not added to paperless.

Ideally, the OCR should not fail, however, I understand that this is not paperless' fault but ocrmypdf's issue. However, as a result, paperless completely exits the parsing thread and does not even put the documents into the database. This definitely should not happen.

Here's the error log:

`[2021-08-16 23:03:00,888] [INFO] [paperless.consumer] Consuming FeesReceipt70292.pdf

[2021-08-16 23:03:00,890] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2021-08-16 23:03:00,914] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2021-08-16 23:03:00,922] [DEBUG] [paperless.consumer] Parsing FeesReceipt70292.pdf...

[2021-08-16 23:03:01,144] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-mail-se_l776e

[2021-08-16 23:03:01,380] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-mail-se_l776e', 'output_file': '/tmp/paperless/paperless-0c2htufj/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng+deu', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-0c2htufj/sidecar.txt'}

[2021-08-16 23:03:01,408] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: . Attempting force OCR to get the text.

[2021-08-16 23:03:01,409] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-mail-af62hz1v', 'output_file': '/tmp/paperless/paperless-xm7g_mo6/archive-fallback.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng+deu', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-xm7g_mo6/sidecar-fallback.txt'}

[2021-08-16 23:03:01,799] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-xm7g_mo6

[2021-08-16 23:03:01,807] [ERROR] [paperless.consumer] Error while consuming document FeesReceipt66482.pdf: InputFileError:

Traceback (most recent call last):

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 163, in get_pdfinfo

executor=executor,

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/pdfinfo/info.py", line 857, in init

with pikepdf.open(infile) as pdf:

File "/usr/local/lib/python3.7/site-packages/pikepdf/_methods.py", line 955, in open

access_mode=access_mode,

pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.sm5oc8iv/origin.pdf: unable to find trailer dictionary while recovering damaged file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 241, in parse

ocrmypdf.ocr(**args)

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 340, in ocr

return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 365, in run_pipeline

check_pages=options.pages,

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 168, in get_pdfinfo

raise InputFileError() from e

ocrmypdf.exceptions.InputFileError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 163, in get_pdfinfo

executor=executor,

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/pdfinfo/info.py", line 857, in init

with pikepdf.open(infile) as pdf:

File "/usr/local/lib/python3.7/site-packages/pikepdf/_methods.py", line 955, in open

access_mode=access_mode,

pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.7_8hxrkb/origin.pdf: unable to find trailer dictionary while recovering damaged file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 276, in parse

ocrmypdf.ocr(**args)

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 340, in ocr

return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 365, in run_pipeline

check_pages=options.pages,

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 168, in get_pdfinfo

raise InputFileError() from e

ocrmypdf.exceptions.InputFileError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file

document_parser.parse(self.path, mime_type, self.filename)

File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in parse

raise ParseError(f"{e.__class__.__name__}: {str(e)}")

documents.parsers.ParseError: InputFileError:

[2021-08-16 23:03:01,942] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: . Attempting force OCR to get the text.

[2021-08-16 23:03:01,943] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-mail-se_l776e', 'output_file': '/tmp/paperless/paperless-0c2htufj/archive-fallback.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng+deu', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-0c2htufj/sidecar-fallback.txt'}

[2021-08-16 23:03:02,337] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-0c2htufj

[2021-08-16 23:03:02,346] [ERROR] [paperless.consumer] Error while consuming document FeesReceipt70292.pdf: InputFileError:

Traceback (most recent call last):

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 163, in get_pdfinfo

executor=executor,

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/pdfinfo/info.py", line 857, in init

with pikepdf.open(infile) as pdf:

File "/usr/local/lib/python3.7/site-packages/pikepdf/_methods.py", line 955, in open

access_mode=access_mode,

pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.2nud4r6r/origin.pdf: unable to find trailer dictionary while recovering damaged file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 241, in parse

ocrmypdf.ocr(**args)

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 340, in ocr

return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 365, in run_pipeline

check_pages=options.pages,

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 168, in get_pdfinfo

raise InputFileError() from e

ocrmypdf.exceptions.InputFileError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 163, in get_pdfinfo

executor=executor,

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/pdfinfo/info.py", line 857, in init

with pikepdf.open(infile) as pdf:

File "/usr/local/lib/python3.7/site-packages/pikepdf/_methods.py", line 955, in open

access_mode=access_mode,

pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.g_erw2ru/origin.pdf: unable to find trailer dictionary while recovering damaged file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 276, in parse

ocrmypdf.ocr(**args)

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 340, in ocr

return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 365, in run_pipeline

check_pages=options.pages,

File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 168, in get_pdfinfo

raise InputFileError() from e

ocrmypdf.exceptions.InputFileError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file

document_parser.parse(self.path, mime_type, self.filename)

File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in parse

raise ParseError(f"{e.__class__.__name__}: {str(e)}")

documents.parsers.ParseError: InputFileError:`

jonaswinkler commented 2 years ago

pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.7_8hxrkb/origin.pdf: unable to find trailer dictionary while recovering damaged file

Looks like a broken PDF file to me.

talan-z commented 2 years ago

pikepdf._qpdf.PdfError: /tmp/ocrmypdf.io.7_8hxrkb/origin.pdf: unable to find trailer dictionary while recovering damaged file

Looks like a broken PDF file to me.

That's correct and not the issue that I highlight here. The parsing thread crashes as a result and the file is simply "lost". That should not be the case and is a bug from my point of view.