Open rtibbles opened 3 years ago
Preliminary exploration confirmed that pdftk is able to do the kind of segmenting operations we require on wkhtmltopdf generated PDFs, so using that seems like the best way forward.
I've done some digging here and want to update the issue.
Because of an extant bug in PyPDF2 mstamy2/PyPDF2#193 trying to read the outline for a file generated in wkhtmltopdf results in an error.
This isn't the issue - the issue is that wkhtmltopdf
does not generate a meaningful metadata about bookmarks even when it successfully gleans that they exist during parsing. wkhtmltopdf
seems to be able to generate the metadata manually, but only from a file generated and tweaked separately. Ultimately, the only metadata we get from a wkhtmltopdf
file is the hierarchical structure of bookmarks/outline items (ie, which of the chapters are subchapters of others).
Also that PyPDF2 issue is actually describing an issue that happens during instantiation - it just so happens that instantiation manages to call the "private" method _buildOutline
.
So - the actual issue is:
PDFs generated without proper metadata can cause PyPDF2 to throw an error from which we do not gracefully recover - nor do we provide a meaningful and informative error message about why we get the error.
Potential Solutions
1. Just catch the error that is thrown by PyPDF2 and raise a new one to inform the user of why it won't work.
In this case we need to let the user know:
get_toc
at all with this PDF. Using split_chapters
or split_subchapters
with this PDF will require that you pass your own jsondata
value as a param as the file does not contain sufficient metadata for extraction.write_pagerange
to do so manually.Also - likely this should be better documented so that the error can point to a part of the documentation. For example, this is the data structure output from get_toc(subchapters=False)
- we should have a version w/ subchapters=True
and document this to the user.
2. Wrap pdftk - breaks the current PDFParser API
PDFParser does much of its heavy lifting using the PyPDF2 tools. Changing this won't be too hard. I've already done so in #324 to some degree and it won't be too much work to make write_pagerange
use pdftk as well.
However, there are some things to consider:
pdftk
isn't the easiest thing to install on all OSs. It isn't available in publish Debian/Ubuntu repositories through apt - and the pdftk-java
wrapper of it is behind several patch versions - ones which resolve an issue with the version used in pdftk-java
. So Linux users have to download the binary from Gitlab for 2.3.3 or later. Also isn't available on Homebrew for Macpdf
which has is an instance of PyPDF.PdfFileReader
- so anybody who has written code using that property will see their chefs break. This may not be much of an issue, however, but it's worth considering how we can deprecate it gracefully if we go this route.3. Fork PyPDF2 and comment the error out
Description
Because of an extant bug in PyPDF2 https://github.com/mstamy2/PyPDF2/issues/193 trying to read the outline for a file generated in wkhtmltopdf results in an error. This means that large PDF files generated in this way cannot be split up into smaller PDFs.
Potential resolutions
The issue contains a way to bypass this issue by patching the library - but it's not clear whether it actually fixes the problem, or just prevents a blocking error. I suspect that it just lets the PDF to continue to be parsed, but that we would not get the outline data - which we need in order to do the by chapter and subchapter splitting that we are using PyPDF2 for.
Doing a more systematic fix on this would potentially involve forking and diving deeper into PyPDF2, as there appears to be no current maintenance of the library.
There is another Python library https://github.com/pdfminer/pdfminer.six that is actively maintained - but it appears that it only extracts layout metadata. In order to use this, we would have to migrate the outline reading code to use pdfminer.six, but still continue to use PyPDF2 for the actual page splitting.
Another alternative is to use the pdftk binary https://www.pdflabs.com/tools/pdftk-server/ - while there is a Python wrapper library https://github.com/revolunet/pypdftk for this, it has a very limited API and does not expose the tools we need. However, pdftk is able to read the entire outline and dump all the relevant information to JSON, which we could then read back into Python. We could then also use pdftk to do the splitting as well.
In terms of ongoing maintainability, I think option 4 is probably the most promising bet, as we are leveraging a widely used open source binary to handle the complex business of PDF parsing. One possible upside is that whereas PyPDF2 failed to preserve accessibility focused aspects of PDFs during the splitting, pdftk might do a better job of this.
We would have to add good graceful failure and the ability to configure the path to the pdftk binary in order to leverage this.
Next steps
First we would need to sanity check that pdftk run from the command line can handle wkhtmltopdf generated PDFs and spit out their outlines and split them.
If this does indeed work, to move forward with this strategy, we would have to remove PyPDF2 and replace operations with subprocess calls to operations in pdftk: https://www.pdflabs.com/docs/pdftk-man-page/
dump_data
command. Writing the JSON out to a temporary file in a subprocess, reading that JSON back into Python and then cleaning up the temporary file.cat
command and pass the page ranges parsed fromget_toc
to generate the new file.Because of deferring to pdftk for the PDF processing, most of the other logic in the
PDFParser
class could be removed.