Keep in-document hyperlinks after merged

no1xsyzy commented 6 years ago

Background: I used pandoc+texlive for my thesis and pdfmerge for another submission (I study in cooperated college). After merges, TOC links and reference links don't work anymore. They are still links but clicking it will not navigate to the link target. I think it a bug because they should've been there. They are hyperlinks, and their direction is definite.

metaist commented 6 years ago

Hi @no1xsyzy! Thanks for using pdfmerge. I ran a few tests to try and figure out what's going on.

Inputs

cover.pdf (a 1-page pdf with the "cover")
body.pdf (a 3-page pdf with a TOC, and two pages that the TOC links to).

Test 1: pdfmerge cover.pdf body.pdf -o test-1.pdf

Works as expected; all the links still work.

Test 2: pdfmerge cover.pdf "body.pdf[1]" "body.pdf[3]" "body.pdf[2]" -o test-2.pdf

Links no longer work.

Test 3: pdfmerge cover.pdf "body.pdf[1]" "body.pdf[2]" "body.pdf[3]" -o test-3.pdf

Links still don't work.

Test 4: pdfmerge cover.pdf "body.pdf[1..2]" "body.pdf[3]" -o test-4.pdf

First link works, second one doesn't.

I'm not exactly sure what is happening, but it seems that if the page with a link and it's target aren't written to the output stream at the same time, the link gets broken.

pdfmerge is built on pyPDF2, so I'm going to see if there's any information about how this works and if there's anything I can do to prevent that from happening.

Is there any other information about what you were trying to do that I should know in diagnosing this error?

no1xsyzy commented 6 years ago

Actually what I did: pdfmerge cover.pdf body.pdf[2..-1] -o test.pdf

Example files: (I tried to make TOC but failed to put that to the second page, so here's a citation hyperlink) body.pdf cover.pdf

metaist commented 6 years ago

This is very interesting because it disproved my hypothesis. I need to learn more about how links get put into the output stream, but for the record, this is where I'm adding pages to the output stream which just calls addPage using pyPDF2.

Not sure at which point the links are getting dropped.

exptom commented 6 years ago

I would also be very interested in a fix for this. My use case is merging multiple complete pdf documents (not picking specific pages from any). My first pdf always has a TOC on the 2nd page but I then merge any additional number of pdfs on the end (these are appendices) and the TOC from the first pdf is broken.

I get this warning when I do the merge (I'm not sure if its relevant or not?):

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]

metaist commented 6 years ago

Hi @exptom! Thanks for reaching out about your situation and the warning you're getting. This might all be related, so we'll keep it in this issue for now.

I just did a test with adding a pdf to the end of a pdf that starts with a TOC and the links still work. What are you using to generate the separate pdfs?

exptom commented 6 years ago

@metaist thanks for getting back to me. The initial pdf that includes the TOC is created using wkhtmltopdf (https://github.com/wkhtmltopdf/wkhtmltopdf) and the additional pdfs that are merged as appendicies can come from anywhere. (Users upload them)

metaist commented 6 years ago

Oh, so they literally start out as HTML links, are converted to PDF links. Interesting. Will begin my deep dive into how PDF links actually work and are encoded. This may require an upstream patch to PyPDF2 once I figure out how their stuff works.

I'm also looking at other places where people have issues with PDF links (e.g., combine_pdf) to see if I can learn anything from their general experience.

Unfortunately, I do not have an easy short-term fix, but will keep this issue open and post here as I learn new things.

exptom commented 6 years ago

They aren't actually HTML links. What happens is that wkhtmltopdf converts the HTML page to a PDF document and scans the HTML pulling out all the heading tags (<h1>,<h2>,etc..) and uses them to generate a TOC.

metaist commented 1 year ago

I just released pdfmerge 1.0.0 which uses the newer version of pypdf and I went back to check if this issue still exists. Unfortunately, it does. Anybody have any ideas on how links in PDF work?

metaist commented 1 month ago

It seems like pdftk can correctly merge documents. Perhaps I should make pdfmerge a wrapper around pdftk instead of pypdf.

metaist / pdfmerge

Keep in-document hyperlinks after merged #22

First link works, second one doesn't.