paperless-ngx / paperless-ngx

A community-supported supercharged version of paperless: scan, index and archive all your physical documents
https://docs.paperless-ngx.com
GNU General Public License v3.0
21.94k stars 1.19k forks source link

[BUG] Paper processing error: ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([]),) #2394

Closed dli7319 closed 1 year ago

dli7319 commented 1 year ago

Description

I am encountering errors when I try to upload some of my PDF files. The issue occurs on my self-hosted instance and on the demo instance.

Examples of two PDF files with issues:

  1. https://davidl.me/resources/papers/Li_Progressive_Multi_scale_Light_Field_Networks_3DV2022.pdf
  2. https://storage.googleapis.com/pub-tools-public-publication-data/pdf/3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf

Screenshot: image

Steps to reproduce

  1. Go to the dashboard
  2. Drag one of those PDFs to the upload area
  3. Wait for processing to fail

Webserver logs

[2023-01-09 20:51:47,714] [INFO] [paperless.consumer] Consuming 3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf

[2023-01-09 20:51:47,721] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2023-01-09 20:51:47,728] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2023-01-09 20:51:47,737] [DEBUG] [paperless.consumer] Parsing 3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf...

[2023-01-09 20:51:51,814] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-oqpx_zxi

[2023-01-09 20:51:52,171] [DEBUG] [paperless.parsing.tesseract] Detected language en

[2023-01-09 20:51:52,242] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-upload-oqpx_zxi'), 'output_file': PosixPath('/tmp/paperless/paperless-flepir3y/archive.pdf'), 'use_threads': True, 'jobs': 6, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-flepir3y/sidecar.txt')}

[2023-01-09 20:51:52,559] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-flepir3y

[2023-01-09 20:51:52,578] [ERROR] [paperless.consumer] Error while consuming document 3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf: ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([0, Decimal('1.0000001'), Decimal('407.24936'), Decimal('267.78995')]),)

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 321, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/api.py", line 332, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 378, in run_pipeline

    pdfinfo = get_pdfinfo(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 165, in get_pdfinfo

    return PdfInfo(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 934, in __init__

    self._pages = _pdf_pageinfo_concurrent(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 711, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_concurrent.py", line 87, in __call__

    self._execute(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 141, in _execute

    result = future.result()

  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result

    return self.__get_result()

  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 668, in _pdf_pageinfo_sync

    page = PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 748, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 794, in _gather_pageinfo

    for info in _process_content_streams(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 596, in _process_content_streams

    yield from _find_form_xobject_images(pdf, container, contentsinfo)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 543, in _find_form_xobject_images

    yield from _process_content_streams(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 588, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 236, in _interpret_contents

    ctm = PdfMatrix(operands) @ ctm

  File "/usr/local/lib/python3.9/site-packages/pikepdf/models/matrix.py", line 56, in __init__

    raise ValueError('invalid arguments: ' + repr(args))

ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([0, Decimal('1.0000001'), Decimal('407.24936'), Decimal('267.78995')]),)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/consumer.py", line 337, in try_consume_file

    document_parser.parse(self.path, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 379, in parse

    raise ParseError(f"{e.__class__.__name__}: {str(e)}") from e

documents.parsers.ParseError: ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([0, Decimal('1.0000001'), Decimal('407.24936'), Decimal('267.78995')]),)

[2023-01-09 20:52:36,443] [INFO] [paperless.consumer] Consuming Li_Progressive_Multi_scale_Light_Field_Networks_3DV2022.pdf

[2023-01-09 20:52:36,455] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2023-01-09 20:52:36,464] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2023-01-09 20:52:36,483] [DEBUG] [paperless.consumer] Parsing Li_Progressive_Multi_scale_Light_Field_Networks_3DV2022.pdf...

[2023-01-09 20:52:37,877] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-l4alzhqz

[2023-01-09 20:52:38,213] [DEBUG] [paperless.parsing.tesseract] Detected language en

[2023-01-09 20:52:38,346] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-upload-l4alzhqz'), 'output_file': PosixPath('/tmp/paperless/paperless-wsw694ju/archive.pdf'), 'use_threads': True, 'jobs': 6, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-wsw694ju/sidecar.txt')}

[2023-01-09 20:52:38,750] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-wsw694ju

[2023-01-09 20:52:38,770] [ERROR] [paperless.consumer] Error while consuming document Li_Progressive_Multi_scale_Light_Field_Networks_3DV2022.pdf: ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([]),)

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 321, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/api.py", line 332, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 378, in run_pipeline

    pdfinfo = get_pdfinfo(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 165, in get_pdfinfo

    return PdfInfo(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 934, in __init__

    self._pages = _pdf_pageinfo_concurrent(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 711, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_concurrent.py", line 87, in __call__

    self._execute(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 141, in _execute

    result = future.result()

  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result

    return self.__get_result()

  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 668, in _pdf_pageinfo_sync

    page = PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 748, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 794, in _gather_pageinfo

    for info in _process_content_streams(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 596, in _process_content_streams

    yield from _find_form_xobject_images(pdf, container, contentsinfo)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 543, in _find_form_xobject_images

    yield from _process_content_streams(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 588, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 236, in _interpret_contents

    ctm = PdfMatrix(operands) @ ctm

  File "/usr/local/lib/python3.9/site-packages/pikepdf/models/matrix.py", line 56, in __init__

    raise ValueError('invalid arguments: ' + repr(args))

ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([]),)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/consumer.py", line 337, in try_consume_file

    document_parser.parse(self.path, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 379, in parse

    raise ParseError(f"{e.__class__.__name__}: {str(e)}") from e

documents.parsers.ParseError: ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([]),)

[2023-01-09 20:53:32,071] [INFO] [paperless.consumer] Consuming 3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf

[2023-01-09 20:53:32,082] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2023-01-09 20:53:32,089] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2023-01-09 20:53:32,104] [DEBUG] [paperless.consumer] Parsing 3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf...

[2023-01-09 20:53:36,156] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-rs4s6qnq

[2023-01-09 20:53:36,485] [DEBUG] [paperless.parsing.tesseract] Detected language en

[2023-01-09 20:53:36,550] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': PosixPath('/tmp/paperless/paperless-upload-rs4s6qnq'), 'output_file': PosixPath('/tmp/paperless/paperless-l8gl2ul_/archive.pdf'), 'use_threads': True, 'jobs': 6, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': PosixPath('/tmp/paperless/paperless-l8gl2ul_/sidecar.txt')}

[2023-01-09 20:53:36,871] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-l8gl2ul_

[2023-01-09 20:53:36,889] [ERROR] [paperless.consumer] Error while consuming document 3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf: ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([0, Decimal('1.0000001'), Decimal('407.24936'), Decimal('267.78995')]),)

Traceback (most recent call last):

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 321, in parse

    ocrmypdf.ocr(**args)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/api.py", line 332, in ocr

    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 378, in run_pipeline

    pdfinfo = get_pdfinfo(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 165, in get_pdfinfo

    return PdfInfo(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 934, in __init__

    self._pages = _pdf_pageinfo_concurrent(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 711, in _pdf_pageinfo_concurrent

    executor(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_concurrent.py", line 87, in __call__

    self._execute(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 141, in _execute

    result = future.result()

  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result

    return self.__get_result()

  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result

    raise self._exception

  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run

    result = self.fn(*self.args, **self.kwargs)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 668, in _pdf_pageinfo_sync

    page = PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 748, in __init__

    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 794, in _gather_pageinfo

    for info in _process_content_streams(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 596, in _process_content_streams

    yield from _find_form_xobject_images(pdf, container, contentsinfo)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 543, in _find_form_xobject_images

    yield from _process_content_streams(

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 588, in _process_content_streams

    contentsinfo = _interpret_contents(container, initial_shorthand)

  File "/usr/local/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 236, in _interpret_contents

    ctm = PdfMatrix(operands) @ ctm

  File "/usr/local/lib/python3.9/site-packages/pikepdf/models/matrix.py", line 56, in __init__

    raise ValueError('invalid arguments: ' + repr(args))

ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([0, Decimal('1.0000001'), Decimal('407.24936'), Decimal('267.78995')]),)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/src/paperless/src/documents/consumer.py", line 337, in try_consume_file

    document_parser.parse(self.path, mime_type, self.filename)

  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 379, in parse

    raise ParseError(f"{e.__class__.__name__}: {str(e)}") from e

documents.parsers.ParseError: ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([0, Decimal('1.0000001'), Decimal('407.24936'), Decimal('267.78995')]),)

Browser logs

No response

Paperless-ngx version

1.11.3

Host OS

Ubuntu 22.04 Server

Installation method

Docker - official image

Browser

No response

Configuration changes

No response

Other

No response

dli7319 commented 1 year ago

Possibly the same issue as jonaswinkler/paperless-ng/issues/1151

stumpylog commented 1 year ago

This appears to be an error in either pikepdf or OCRMypdf, but probably the former. There's nothing we're able to do about it, so I'd suggest opening an issue upstream.

You can also try the usual fixes:

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.