Open no1xsyzy opened 6 years ago
Hi @no1xsyzy! Thanks for using pdfmerge. I ran a few tests to try and figure out what's going on.
Inputs
cover.pdf
(a 1-page pdf with the "cover")body.pdf
(a 3-page pdf with a TOC, and two pages that the TOC links to).Test 1: pdfmerge cover.pdf body.pdf -o test-1.pdf
Test 2: pdfmerge cover.pdf "body.pdf[1]" "body.pdf[3]" "body.pdf[2]" -o test-2.pdf
Test 3: pdfmerge cover.pdf "body.pdf[1]" "body.pdf[2]" "body.pdf[3]" -o test-3.pdf
Test 4: pdfmerge cover.pdf "body.pdf[1..2]" "body.pdf[3]" -o test-4.pdf
I'm not exactly sure what is happening, but it seems that if the page with a link and it's target aren't written to the output stream at the same time, the link gets broken.
pdfmerge is built on pyPDF2, so I'm going to see if there's any information about how this works and if there's anything I can do to prevent that from happening.
Is there any other information about what you were trying to do that I should know in diagnosing this error?
This is very interesting because it disproved my hypothesis. I need to learn more about how links get put into the output stream, but for the record, this is where I'm adding pages to the output stream which just calls addPage using pyPDF2.
Not sure at which point the links are getting dropped.
I would also be very interested in a fix for this. My use case is merging multiple complete pdf documents (not picking specific pages from any). My first pdf always has a TOC on the 2nd page but I then merge any additional number of pdfs on the end (these are appendices) and the TOC from the first pdf is broken.
I get this warning when I do the merge (I'm not sure if its relevant or not?):
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
Hi @exptom! Thanks for reaching out about your situation and the warning you're getting. This might all be related, so we'll keep it in this issue for now.
I just did a test with adding a pdf to the end of a pdf that starts with a TOC and the links still work. What are you using to generate the separate pdfs?
@metaist thanks for getting back to me. The initial pdf that includes the TOC is created using wkhtmltopdf (https://github.com/wkhtmltopdf/wkhtmltopdf) and the additional pdfs that are merged as appendicies can come from anywhere. (Users upload them)
Oh, so they literally start out as HTML links, are converted to PDF links. Interesting. Will begin my deep dive into how PDF links actually work and are encoded. This may require an upstream patch to PyPDF2 once I figure out how their stuff works.
I'm also looking at other places where people have issues with PDF links (e.g., combine_pdf) to see if I can learn anything from their general experience.
Unfortunately, I do not have an easy short-term fix, but will keep this issue open and post here as I learn new things.
They aren't actually HTML links. What happens is that wkhtmltopdf converts the HTML page to a PDF document and scans the HTML pulling out all the heading tags (<h1>,<h2>
,etc..) and uses them to generate a TOC.
I just released pdfmerge
1.0.0 which uses the newer version of pypdf
and I went back to check if this issue still exists. Unfortunately, it does. Anybody have any ideas on how links in PDF work?
Background: I used pandoc+texlive for my thesis and pdfmerge for another submission (I study in cooperated college). After merges, TOC links and reference links don't work anymore. They are still links but clicking it will not navigate to the link target. I think it a bug because they should've been there. They are hyperlinks, and their direction is definite.