pmaupin / pdfrw

pdfrw is a pure Python library that reads and writes PDFs
Other
1.84k stars 271 forks source link

Preserve table of contents when editing PDF #226

Open NiklasKappel opened 2 years ago

NiklasKappel commented 2 years ago

I use the following simple combination of PdfReader and PdfWriter to edit PDF files.

import pdfrw

def main():
    pdf = pdfrw.PdfReader(open("input.pdf", "rb"))
    writer = pdfrw.PdfWriter()
    for pageNumber in range(pdf.numPages):
        page = pdf.pages[pageNumber]
        # Work on page.
        writer.addpage(page)
    writer.write("output.pdf")

if __name__ == "__main__":
    main()

Where # Work on page. is a placeholder for code that sometimes does nothing to the page and sometimes merges the page with a different pre-prepared page.

If input.pdf contains a table of contents (in the form of metadata that can be displayed in the sidebar of pdf viewers and used to navigate the document), the table of contents is missing in output.pdf.

Is it possible to preserve the table of contents in this example using pdfrw? E.g. is it possible to extract the table of contents from input.pdf and paste it into output.pdf?

sl2c commented 10 months ago

Hi!

The way you do it you are creating a blank PDF document and adding pages to it. However, if you want to edit an existing document a much simpler way would be to omit the call to addPage altogether, edit the pages in pdf in-place, just as your code does, and in the end simply write the result out using:

PdfWriter('output.pdf', trailer=pdf).write()