learningequality / ricecooker

Python library for creating Kolibri channels and uploading to Studio
https://ricecooker.readthedocs.io/
MIT License
18 stars 53 forks source link

PDF Files created with wkhtmltopdf cannot have their outline parsed #312

Open rtibbles opened 3 years ago

rtibbles commented 3 years ago

Description

Because of an extant bug in PyPDF2 https://github.com/mstamy2/PyPDF2/issues/193 trying to read the outline for a file generated in wkhtmltopdf results in an error. This means that large PDF files generated in this way cannot be split up into smaller PDFs.

Potential resolutions

  1. The issue contains a way to bypass this issue by patching the library - but it's not clear whether it actually fixes the problem, or just prevents a blocking error. I suspect that it just lets the PDF to continue to be parsed, but that we would not get the outline data - which we need in order to do the by chapter and subchapter splitting that we are using PyPDF2 for.

  2. Doing a more systematic fix on this would potentially involve forking and diving deeper into PyPDF2, as there appears to be no current maintenance of the library.

  3. There is another Python library https://github.com/pdfminer/pdfminer.six that is actively maintained - but it appears that it only extracts layout metadata. In order to use this, we would have to migrate the outline reading code to use pdfminer.six, but still continue to use PyPDF2 for the actual page splitting.

  4. Another alternative is to use the pdftk binary https://www.pdflabs.com/tools/pdftk-server/ - while there is a Python wrapper library https://github.com/revolunet/pypdftk for this, it has a very limited API and does not expose the tools we need. However, pdftk is able to read the entire outline and dump all the relevant information to JSON, which we could then read back into Python. We could then also use pdftk to do the splitting as well.

In terms of ongoing maintainability, I think option 4 is probably the most promising bet, as we are leveraging a widely used open source binary to handle the complex business of PDF parsing. One possible upside is that whereas PyPDF2 failed to preserve accessibility focused aspects of PDFs during the splitting, pdftk might do a better job of this.

We would have to add good graceful failure and the ability to configure the path to the pdftk binary in order to leverage this.

Next steps

First we would need to sanity check that pdftk run from the command line can handle wkhtmltopdf generated PDFs and spit out their outlines and split them.

If this does indeed work, to move forward with this strategy, we would have to remove PyPDF2 and replace operations with subprocess calls to operations in pdftk: https://www.pdflabs.com/docs/pdftk-man-page/

Because of deferring to pdftk for the PDF processing, most of the other logic in the PDFParser class could be removed.

rtibbles commented 3 years ago

Preliminary exploration confirmed that pdftk is able to do the kind of segmenting operations we require on wkhtmltopdf generated PDFs, so using that seems like the best way forward.

nucleogenesis commented 3 years ago

I've done some digging here and want to update the issue.

Because of an extant bug in PyPDF2 mstamy2/PyPDF2#193 trying to read the outline for a file generated in wkhtmltopdf results in an error.

This isn't the issue - the issue is that wkhtmltopdf does not generate a meaningful metadata about bookmarks even when it successfully gleans that they exist during parsing. wkhtmltopdf seems to be able to generate the metadata manually, but only from a file generated and tweaked separately. Ultimately, the only metadata we get from a wkhtmltopdf file is the hierarchical structure of bookmarks/outline items (ie, which of the chapters are subchapters of others).

Also that PyPDF2 issue is actually describing an issue that happens during instantiation - it just so happens that instantiation manages to call the "private" method _buildOutline.


So - the actual issue is:

PDFs generated without proper metadata can cause PyPDF2 to throw an error from which we do not gracefully recover - nor do we provide a meaningful and informative error message about why we get the error.

Potential Solutions

1. Just catch the error that is thrown by PyPDF2 and raise a new one to inform the user of why it won't work.

In this case we need to let the user know:

Also - likely this should be better documented so that the error can point to a part of the documentation. For example, this is the data structure output from get_toc(subchapters=False) - we should have a version w/ subchapters=True and document this to the user.

2. Wrap pdftk - breaks the current PDFParser API

PDFParser does much of its heavy lifting using the PyPDF2 tools. Changing this won't be too hard. I've already done so in #324 to some degree and it won't be too much work to make write_pagerange use pdftk as well.

However, there are some things to consider:

3. Fork PyPDF2 and comment the error out