outline (other metadata?) not preserved using cat

jwhendy commented 7 years ago

tl;dr this got long as I tried to investigate. The short summary is it appears the cat function strips off the handy outline/bookmark index of the document. PyPDF2 appears to support this, so this can serve as both 1) a notice this is happening if you weren't aware and 2) a feature request.

In my opinion, if running cat just to spit out the file, stapler should maintain whatever features already existed. I could also see it being a handy option in general, but merging the TOC/index locations of multiple files might get messy?

I just submitted #30 and during my testing I used stapler cat in full, and noticed that the file size was different. Opening them both up, I noticed that in evince, at least, there was no outline/table of contents in the stapler generated version, but there was in the original. Here's both open (evince, arch linux) with the original on the left (Outline view in side pane) and the stapler generated on the right, showing that this view is not available. Also of note is the "meta title" is removed from the stapler version.

2017-05-01_182957

I thought perhaps PyPDF doesn't provide this, but in looking around it appears it might, or at least some ability:

PdfFileReader

getOutlines(node=None, outlines=None) Retrieves the document outline present in the document.

Returns: a nested list of Destinations.

PdfFileWriter

addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit', *args) Add a bookmark to this PDF file.

This seemed straight foward: just translate getOutlines() to addBookmark? Not as easy... I couldn't seem to find a way to get the page number from the result (though I'm an absolute python novice, so no surprise there). After some fiddling, I was able to use some example code to manually add a bookmark, and found that at least two answers tried to tackle the issue of converting getOutlines() return location ID thingy into a page number.[1] [2]

Find attached:

test-original.pdf: file generated using Org-mode/LaTeX that I knew would feature a TOC/outline
test-stapler.pdf: file produced with stapler cat test.pdf test-cat.pdf
test-pypdf2.pdf: file produced from the following code

#!/usr/bin/env python2

from PyPDF2 import PdfFileWriter, PdfFileReader

# code for translating from bookmark to page number
# - http://stackoverflow.com/questions/1918420/split-a-pdf-based-on-outline
# - http://stackoverflow.com/questions/8329748/how-to-get-bookmarks-page-number
def _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):
    if _result is None:
        _result = {}
    if pages is None:
        _num_pages = []
        pages = pdf.trailer["/Root"].getObject()["/Pages"].getObject()
    t = pages["/Type"]
    if t == "/Pages":
        for page in pages["/Kids"]:
            _result[page.idnum] = len(_num_pages)
            _setup_page_id_to_num(pdf, page.getObject(), _result, _num_pages)
    elif t == "/Page":
        _num_pages.append(1)
    return _result

orig = PdfFileReader(open("./test-original.pdf", "rb"))
outPdf = PdfFileWriter()
outStream = file("./test-pypdf2.pdf", "wb")

outPdf.addPage(orig.getPage(0))
outPdf.addPage(orig.getPage(1))
outPdf.addPage(orig.getPage(2))

id_to_nums = _setup_page_id_to_num(orig)
outline = orig.getOutlines()

for entry in outline :
  title = entry["/Title"]
  page = id_to_nums[entry.page.idnum] ## +1 in original code (physical page, not index)
  print title
  outPdf.addBookmark(title, page, parent = None)

outPdf.write(outStream)

attachments

footnotes

hellerbarde commented 7 years ago

Thank you for both of your very thorough issues! I wonder if we can somehow feed the outline back in to the FileWriter... The workaround you mentioned with the bookmarks sounds a little tedious and not very satisfying to use.

jwhendy commented 7 years ago

@hellerbarde Thanks for taking a look, and happy to submit for a piece of software I think is just so great to have around (even more so when I can show windows users at work what's possible :) ).

One other issue with trying to pull together the indices, is will you, the programmer, ever know what derivation of original indices I want if I'm catting multiple files? Or say there was a bookmark on pg 3 and someone cats 4-n; should you ditch the original since they extracted after, or include it since pg 4 is still part of that section?

No easy answers, and thanks for taking a look!

hellerbarde commented 7 years ago

I have looked into this now and it seems the underlying library doesn't offer much assistance here unfortunately. I'll have to see if I can do anything about it.

jwhendy commented 7 years ago

No worries at all, and you can close if you want. I'm guessing this is pretty fringe, as I would only have expected it to work on full docs. As mentioned in the above comment, if someone is merging a bunch of extractions from different files, I don't see a good way to guess what original bookmarks/sections should be included. Thanks for looking, and this is still an amazing tool :)

hellerbarde commented 7 years ago

OK. I'll leave this open as a reminder to look into it again for merging complete documents. But I don't think I'll dive into the nuts and bolts of figuring out how PDF does TOCs... :)

Thanks for the kind words and stuff.

PS: I'm working on a GUI for concatenating files. Psst, mum's the word!

Frenzie commented 7 years ago

Apologies for the off topic remark in advance but I figured this could be useful to some. For GUIs there's also a useful little program called pdfshuffler, but I'm not sure how it handles TOCs. I usually turn to pdftk cat for that kind of thing because it deals with it fairly well.

hellerbarde / stapler

outline (other metadata?) not preserved using cat #31