[BUG] [ERROR] [paperless.consumer] Error while consuming document xxx.pdf: ValueError: overflow/underflow converting 2834645669291339000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 to 64-bit integer

watou commented 3 years ago

Describe the bug Unable to consume an existing PDF due to content of the PDF file. No PDF viewer has ever indicated a problem with the file.

To Reproduce

Attempt to reproduce a PDF with similar contents that expose the failure.
Examine log to see the traceback.

Expected behavior PDF file is consumed without complaint.

Webserver logs

Traceback (most recent call last):

  File "/app/paperless/src/paperless_tesseract/parsers.py", line 241, in parse
    ocrmypdf.ocr(**args)
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/api.py", line 340, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_sync.py", line 359, in run_pipeline
    pdfinfo = get_pdfinfo(
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_pipeline.py", line 157, in get_pdfinfo
    return PdfInfo(
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/pdfinfo/info.py", line 860, in __init__
    self._pages = _pdf_pageinfo_concurrent(
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/pdfinfo/info.py", line 644, in _pdf_pageinfo_concurrent
    executor(
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/_concurrent.py", line 82, in __call__
    self._execute(
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/builtin_plugins/concurrency.py", line 132, in _execute
    for result in results:
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/pdfinfo/info.py", line 601, in _pdf_pageinfo_sync
    page = PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/pdfinfo/info.py", line 675, in __init__
    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/pdfinfo/info.py", line 721, in _gather_pageinfo
    for ci in _process_content_streams(
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/pdfinfo/info.py", line 521, in _process_content_streams
    contentsinfo = _interpret_contents(container, initial_shorthand)
  File "/usr/local/lib/python3.8/dist-packages/ocrmypdf/pdfinfo/info.py", line 161, in _interpret_contents
    pikepdf.parse_content_stream(contentstream, operator_whitelist)
  File "/usr/local/lib/python3.8/dist-packages/pikepdf/models/__init__.py", line 92, in parse_content_stream
    page._parse_page_contents_grouped(operators),

ValueError: overflow/underflow converting 2834645669291339000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 to 64-bit integer

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/app/paperless/src/documents/consumer.py", line 248, in try_consume_file
    document_parser.parse(self.path, mime_type, self.filename)
  File "/app/paperless/src/paperless_tesseract/parsers.py", line 290, in parse
    raise ParseError(f"{e.__class__.__name__}: {str(e)}")
documents.parsers.ParseError: ValueError: overflow/underflow converting 2834645669291339000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 to 64-bit integer

Relevant information

Synology DSM 6 / Docker / linuxserver/paperless_ng:latest
Firefox
Version 1.4.5
Installation method: docker
No configuration changes made in docker-compose.yml, docker-compose.env or paperless.conf.

iwconfig commented 3 years ago

I think you should open an issue over at jbarlow83/OCRmyPDF or maybe pikepdf/pikepdf because this is most likely an issue on their end. You'll have a better chance to get this solved.

iwconfig commented 3 years ago

I'm not sure but, a comment which I believe relates to your issue: https://github.com/jbarlow83/OCRmyPDF/blob/73b8b88724aba1c71df04310e84b8f645b85d287/src/ocrmypdf/pdfinfo/info.py#L121-L145

Also you wrote

No PDF viewer has ever indicated a problem with the file.

The comment mentions

According to the PDF specification, the maximum stack depth is 32. Other viewers tolerate some amount beyond this.

watou commented 3 years ago

@iwconfig thanks very much for the pointers; much appreciated. I won't personally be able to follow up those leads but hopefully this issue is hit in future searches if it comes up again. All the best!

jonaswinkler / paperless-ng