ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.13k stars 1.02k forks source link

KeyError: '/Contents' #154

Closed feinerer closed 7 years ago

feinerer commented 7 years ago
$ docker run --rm -v "$(pwd):/home/docker"   ocrmypdf --skip-text input.pdf output.pdf
   INFO - Tesseract v4.x.alpha found. OCRmyPDF support is experimental.
  ERROR - Traceback (most recent call last):
  File "/appenv/lib/python3.5/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/appenv/lib/python3.5/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pipeline.py", line 183, in repair_pdf
    pdfinfo = pdf_get_all_pageinfo(output_file)
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 524, in pdf_get_all_pageinfo
    return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 524, in <listcomp>
    return [_pdf_get_pageinfo(infile, n) for n in range(pdf.numPages)]
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 496, in _pdf_get_pageinfo
    pageinfo['has_text'] = _page_has_text(pdf, page)
  File "/appenv/lib/python3.5/site-packages/ocrmypdf/pageinfo.py", line 468, in _page_has_text
    text = page.extractText()
  File "/appenv/lib/python3.5/site-packages/PyPDF2/pdf.py", line 2593, in extractText
    content = self["/Contents"].getObject()
  File "/appenv/lib/python3.5/site-packages/PyPDF2/generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
KeyError: '/Contents'

This happens on a Debian Jessie system running the latest Docker container (see above command line).

Unfortunately I cannot include the corresponding PDF as it contains private information.

If you need further information, please give me instructions in order to help you debug this issue. Thank you!

jbarlow83 commented 7 years ago

I'll add a check for this case in the next release.

The PDF is missing a data field that is strictly optional, but almost never omit, and the third party PyPDF2 library does not handle this.

Try re-frying the PDF with Ghostscript as this would likely insert the expected object. Note this constructs a visually identical PDF and will reencode JPEGs in the process.

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=out.pdf in.pdf
feinerer commented 7 years ago

Confirmed: using Ghostscript to rewrite the PDF suffices so that PyPDF2 can handle it.

A direct check in OCRmyPDF is appreciated to avoid the manual Ghostscript call.