ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.29k stars 1.02k forks source link

A set of failing PDFs #325

Closed gwern closed 5 years ago

gwern commented 5 years ago

I recently used ocrmypdf to mass-OCR my PDFs and a bunch of DjVu files I converted to PDF (which strips the original Tesseract OCR so I needed some way to restore it). Worked very nicely, and I like the better compression over the default ddjvu output.

Some files failed. I noticed the mention of a test corpus, so I thought you might like a list of failing files (these failed multiple times, so should be reliable test cases) and the errors.

The errors:

myocr-gwernnet-errors.txt

The files:

jbarlow83 commented 5 years ago

I will take a look.

Do you know what version of ocrmypdf version you used? The stack traces appear to be from an older version.

gwern commented 5 years ago

Whatever Ubuntu 18.0.4.1 ships, which appears to be '6.1.2-1ubuntu1.1' or '6.1.2' from --version.

jbarlow83 commented 5 years ago

Please try the latest released version. There is an installation procedure in the documentation specifically for Ubuntu 18.04. I suspect that will fix many of these errors.

gwern commented 5 years ago

Upgrading to 7.3.1 does fix many of the errors. What's still left:

ocrmypdf-gwernnet-errors2.log

jbarlow83 commented 5 years ago

The problem is quite definitely how these files are formatted. In any case, the next release should be more tolerant of PDFs with these types of errors - it will issue warnings instead.

I went by the logs and concluded the errors are for the same for the most part.

gwern commented 5 years ago

That's good to hear. I hope they'll be good test cases for the next release, then.

ivsanro1 commented 5 years ago

I found another error. Unfortunately, I cannot upload the pdf file, because it has personal data, and I do not know how to reproduce the error by creating a handcrafted pdf file. It seems to be a problem of the internal structure of the pdf file. This is the stacktrace of the error:

  File "/usr/local/lib/python3.5/dist-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    register_cleanup, touch_files_only)
  File "/usr/local/lib/python3.5/dist-packages/ruffus/task.py", line 544, in job_wrapper_io_files
    ret_val = user_defined_work_func(*params)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/_pipeline.py", line 170, in repair_and_parse_pdf
    pdfinfo = PdfInfo(output_file, detailed_page_analysis=detailed_page_analysis, log=log)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 722, in __init__
    infile, detailed_page_analysis, log=log)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 604, in _pdf_get_all_pageinfo
    page = PageInfo(pdf, n, infile, page_xml, detailed_analysis)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 614, in __init__
    self._pageinfo = _pdf_get_pageinfo(pdf, pageno, infile, xmltext)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 571, in _pdf_get_pageinfo
    shorthand=userunit_shorthand)]
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 569, in <listcomp>
    contentsinfo = [ci for ci in
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 485, in _process_content_streams
    yield from _find_regular_images(container, contentsinfo)
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 390, in _find_regular_images
    for pdfimage, xobj in _image_xobjects(container):
  File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 376, in _image_xobjects
    if candidate['/Subtype'] == '/Image':
TypeError: 'NoneType' object is not subscriptable
jbarlow83 commented 5 years ago

The wiki has instructions for encrypting a file for me only if you are comfortable with that. https://github.com/jbarlow83/OCRmyPDF/wiki

ivsanro1 commented 5 years ago

The wiki has instructions for encrypting a file for me only if you are comfortable with that. https://github.com/jbarlow83/OCRmyPDF/wiki

I am afraid I cannot do that, sorry. The document itself pertains to a third-party organization, and the personal info is not mine. I can check why it fails with pdb if it helps.

Thanks

jbarlow83 commented 5 years ago

Probably fixed this, or at least suppressed the immediate cause of stack trace, in next release