ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.82k stars 1.01k forks source link

[Bug]: ValueError: ObjectList must have 6 elements #1303

Closed macdeport closed 5 months ago

macdeport commented 5 months ago

Describe the bug

How did you download and install the software? MacPorts (BTW not offered in the drop-down menu below...) Run ocrmypdf bid\$pdf bid_.pdf => "crash" on this particular file bid$.pdf

Steps to reproduce

1. Run ocrmypdf -v 2 bid\$.pdf bid_.pdf
=> "crash" on this particular file "bid$.pdf"

Files

bid-240430.json OCRmyPDF-bug-scan-240430-1432

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

ocrmypdf 16.2.0

Relevant log output

Python 3.11.9 / ocrmypdf version=16.2.0

Running: ['tesseract', '--version']
Found tesseract 5.3.3
Running: ['tesseract', '--version']
Running: ['gs', '--version']
Found gs 10.3.0
Running: ['gs', '--version']
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages in "/opt/local/share/tessdata/" (4):
deu
eng
fra
osd

pikepdf mmap enabled
os.symlink(37rv-3g-12-central-diag--diagnostics-crep-dpe-190619.pdf, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.yxntmva9/origin)
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.yxntmva9/origin, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.yxntmva9/origin.pdf)
pikepdf mmap enabled
Scanning contents     ━━━━━━╸                                  18%  5/28 0:00:04
Traceback (most recent call last):
  File "/Users/alain/Documents/Logiciels/Developpement/py-km-pathfinder-selection/pathfinder-selection-ocred-pdf-compress.py", line 794, in <module>
    ocrmypdf.ocr(fn_in,fn_out,
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/api.py", line 380, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/_pipelines/ocr.py", line 224, in run_pipeline
    return _run_pipeline(options, plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/_pipelines/ocr.py", line 175, in _run_pipeline
    pdfinfo = get_pdfinfo(
              ^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo
    return PdfInfo(
           ^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 1118, in __init__
    self._pages = _pdf_pageinfo_concurrent(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 777, in _pdf_pageinfo_concurrent
    executor(
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__
    self._execute(
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 144, in _execute
    result = future.result()
             ^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 726, in _pdf_pageinfo_sync
    return PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 841, in __init__
    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 892, in _gather_pageinfo
    for info in _process_content_streams(
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 630, in _process_content_streams
    contentsinfo = _interpret_contents(container, initial_shorthand)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/ocrmypdf/pdfinfo/info.py", line 242, in _interpret_contents
    ctm = Matrix(operands) @ ctm
          ^^^^^^^^^^^^^^^^
ValueError: ObjectList must have 6 elements
jbarlow83 commented 5 months ago

Thanks for attaching the file. While it's sometimes possible to identify an issue by looking at QPDF JSON, in this particular case, the issue involves data in the original PDF. The original PDF is also probably malformed - it looks like there is a content stream that does not have the appropriate number of elements in a matrix, so at least some portion of it isn't going to render correctly.

You could try using Ghostscript to rewrite the PDF - maybe it can find a way to correct the issue or discard: gs -q -sDEVICE=pdfwrite -o out.pdf in.pdf

macdeport commented 5 months ago

test.zip You are welcome. Thank you.

jbarlow83 commented 5 months ago

PDF has many errors and there's no way to recover it. At the point of failure, there's supposed to be a 6-element coordinate matrix that sets up what to draw next, and only 3 elements are there. There's just no way to know what supposed to happen.

I added a more descriptive error message.