Closed gwern closed 5 years ago
I will take a look.
Do you know what version of ocrmypdf version you used? The stack traces appear to be from an older version.
Whatever Ubuntu 18.0.4.1 ships, which appears to be '6.1.2-1ubuntu1.1' or '6.1.2' from --version
.
Please try the latest released version. There is an installation procedure in the documentation specifically for Ubuntu 18.04. I suspect that will fix many of these errors.
Upgrading to 7.3.1 does fix many of the errors. What's still left:
The problem is quite definitely how these files are formatted. In any case, the next release should be more tolerant of PDFs with these types of errors - it will issue warnings instead.
I went by the logs and concluded the errors are for the same for the most part.
That's good to hear. I hope they'll be good test cases for the next release, then.
I found another error. Unfortunately, I cannot upload the pdf file, because it has personal data, and I do not know how to reproduce the error by creating a handcrafted pdf file. It seems to be a problem of the internal structure of the pdf file. This is the stacktrace of the error:
File "/usr/local/lib/python3.5/dist-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/local/lib/python3.5/dist-packages/ruffus/task.py", line 544, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/_pipeline.py", line 170, in repair_and_parse_pdf
pdfinfo = PdfInfo(output_file, detailed_page_analysis=detailed_page_analysis, log=log)
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 722, in __init__
infile, detailed_page_analysis, log=log)
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 604, in _pdf_get_all_pageinfo
page = PageInfo(pdf, n, infile, page_xml, detailed_analysis)
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 614, in __init__
self._pageinfo = _pdf_get_pageinfo(pdf, pageno, infile, xmltext)
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 571, in _pdf_get_pageinfo
shorthand=userunit_shorthand)]
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 569, in <listcomp>
contentsinfo = [ci for ci in
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 485, in _process_content_streams
yield from _find_regular_images(container, contentsinfo)
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 390, in _find_regular_images
for pdfimage, xobj in _image_xobjects(container):
File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pdfinfo/__init__.py", line 376, in _image_xobjects
if candidate['/Subtype'] == '/Image':
TypeError: 'NoneType' object is not subscriptable
The wiki has instructions for encrypting a file for me only if you are comfortable with that. https://github.com/jbarlow83/OCRmyPDF/wiki
The wiki has instructions for encrypting a file for me only if you are comfortable with that. https://github.com/jbarlow83/OCRmyPDF/wiki
I am afraid I cannot do that, sorry. The document itself pertains to a third-party organization, and the personal info is not mine. I can check why it fails with pdb if it helps.
Thanks
Probably fixed this, or at least suppressed the immediate cause of stack trace, in next release
I recently used
ocrmypdf
to mass-OCR my PDFs and a bunch of DjVu files I converted to PDF (which strips the original Tesseract OCR so I needed some way to restore it). Worked very nicely, and I like the better compression over the defaultddjvu
output.Some files failed. I noticed the mention of a test corpus, so I thought you might like a list of failing files (these failed multiple times, so should be reliable test cases) and the errors.
The errors:
myocr-gwernnet-errors.txt
The files: