PDF Files created with wkhtmltopdf cannot have their outline parsed

learningequality / ricecooker

Python library for creating Kolibri channels and uploading to Studio

MIT License

18 stars 53 forks source link

Description

Because of an extant bug in PyPDF2 https://github.com/mstamy2/PyPDF2/issues/193 trying to read the outline for a file generated in wkhtmltopdf results in an error. This means that large PDF files generated in this way cannot be split up into smaller PDFs.

Potential resolutions

The issue contains a way to bypass this issue by patching the library - but it's not clear whether it actually fixes the problem, or just prevents a blocking error. I suspect that it just lets the PDF to continue to be parsed, but that we would not get the outline data - which we need in order to do the by chapter and subchapter splitting that we are using PyPDF2 for.
Doing a more systematic fix on this would potentially involve forking and diving deeper into PyPDF2, as there appears to be no current maintenance of the library.
There is another Python library https://github.com/pdfminer/pdfminer.six that is actively maintained - but it appears that it only extracts layout metadata. In order to use this, we would have to migrate the outline reading code to use pdfminer.six, but still continue to use PyPDF2 for the actual page splitting.
Another alternative is to use the pdftk binary https://www.pdflabs.com/tools/pdftk-server/ - while there is a Python wrapper library https://github.com/revolunet/pypdftk for this, it has a very limited API and does not expose the tools we need. However, pdftk is able to read the entire outline and dump all the relevant information to JSON, which we could then read back into Python. We could then also use pdftk to do the splitting as well.

In terms of ongoing maintainability, I think option 4 is probably the most promising bet, as we are leveraging a widely used open source binary to handle the complex business of PDF parsing. One possible upside is that whereas PyPDF2 failed to preserve accessibility focused aspects of PDFs during the splitting, pdftk might do a better job of this.

We would have to add good graceful failure and the ability to configure the path to the pdftk binary in order to leverage this.

Next steps

First we would need to sanity check that pdftk run from the command line can handle wkhtmltopdf generated PDFs and spit out their outlines and split them.

If this does indeed work, to move forward with this strategy, we would have to remove PyPDF2 and replace operations with subprocess calls to operations in pdftk: https://www.pdflabs.com/docs/pdftk-man-page/

The get_toc method would now leverage the pdftk dump_data command. Writing the JSON out to a temporary file in a subprocess, reading that JSON back into Python and then cleaning up the temporary file.
The split_chapters and split_subchapters methods would now use the pdftk cat command and pass the page ranges parsed from get_toc to generate the new file.

Because of deferring to pdftk for the PDF processing, most of the other logic in the PDFParser class could be removed.

I've done some digging here and want to update the issue.

Because of an extant bug in PyPDF2 mstamy2/PyPDF2#193 trying to read the outline for a file generated in wkhtmltopdf results in an error.

This isn't the issue - the issue is that wkhtmltopdf does not generate a meaningful metadata about bookmarks even when it successfully gleans that they exist during parsing. wkhtmltopdf seems to be able to generate the metadata manually, but only from a file generated and tweaked separately. Ultimately, the only metadata we get from a wkhtmltopdf file is the hierarchical structure of bookmarks/outline items (ie, which of the chapters are subchapters of others).

Also that PyPDF2 issue is actually describing an issue that happens during instantiation - it just so happens that instantiation manages to call the "private" method _buildOutline.

So - the actual issue is:

PDFs generated without proper metadata can cause PyPDF2 to throw an error from which we do not gracefully recover - nor do we provide a meaningful and informative error message about why we get the error.

Potential Solutions

1. Just catch the error that is thrown by PyPDF2 and raise a new one to inform the user of why it won't work.

In this case we need to let the user know:

You cannot use get_toc at all with this PDF. Using split_chapters or split_subchapters with this PDF will require that you pass your own jsondata value as a param as the file does not contain sufficient metadata for extraction.
If you do know the page ranges to which you want to convert the PDF, then you may use write_pagerange to do so manually.

Also - likely this should be better documented so that the error can point to a part of the documentation. For example, this is the data structure output from get_toc(subchapters=False) - we should have a version w/ subchapters=True and document this to the user.

2. Wrap pdftk - breaks the current PDFParser API

PDFParser does much of its heavy lifting using the PyPDF2 tools. Changing this won't be too hard. I've already done so in #324 to some degree and it won't be too much work to make write_pagerange use pdftk as well.

However, there are some things to consider:

pdftk isn't the easiest thing to install on all OSs. It isn't available in publish Debian/Ubuntu repositories through apt - and the pdftk-java wrapper of it is behind several patch versions - ones which resolve an issue with the version used in pdftk-java. So Linux users have to download the binary from Gitlab for 2.3.3 or later. Also isn't available on Homebrew for Mac
This will break the PDFParser API because it currently exposes a property pdf which has is an instance of PyPDF.PdfFileReader - so anybody who has written code using that property will see their chefs break. This may not be much of an issue, however, but it's worth considering how we can deprecate it gracefully if we go this route.

3. Fork PyPDF2 and comment the error out

This gives us another repo to manage (boo) for something really simple and another package to release and care about.

learningequality / ricecooker