ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.13k stars 1.02k forks source link

[BUG] crashes with `TypeError: 'NoneType' object is not subscriptable` #1075

Closed frrad closed 1 year ago

frrad commented 1 year ago

Describe the bug ocrmypdf crashes withTypeError: 'NoneType' object is not subscriptable`

To Reproduce

ocrmypdf 14.0.3.dev5+g9d5fa05a.d20230215
Running: ['tesseract', '--version']
Found tesseract 5.3.0-31-g9d71
Running: ['tesseract', '--version']
Running: ['gs', '--version']
Found gs 9.55.0
Running: ['gs', '--version']
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (7):
chi_sim
deu
eng
fra
osd
por
spa

reading file from standard input
os.symlink(/tmp/ocrmypdf.io.yddbmk4e/stdin, /tmp/ocrmypdf.io.yddbmk4e/origin.pdf)
An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_sync.py", line 378, in run_pipeline
    pdfinfo = get_pdfinfo(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_pipeline.py", line 165, in get_pdfinfo
    return PdfInfo(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 932, in __init__
    self._pages = _pdf_pageinfo_concurrent(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 709, in _pdf_pageinfo_concurrent
    executor(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_concurrent.py", line 87, in __call__
    self._execute(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/builtin_plugins/concurrency.py", line 141, in _execute
    result = future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 666, in _pdf_pageinfo_sync
    page = PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 746, in __init__
    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 792, in _gather_pageinfo
    for info in _process_content_streams(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 594, in _process_content_streams
    yield from _find_form_xobject_images(pdf, container, contentsinfo)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 528, in _find_form_xobject_images
    if candidate['/Subtype'] != '/Form':
TypeError: 'NoneType' object is not subscriptable

Example file file.zip

Expected behavior doesn't crash

System

fixed by https://github.com/ocrmypdf/OCRmyPDF/pull/1066

jbarlow83 commented 1 year ago

Fixed in 14.0.5