pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.17k stars 495 forks source link

the internal hyperlinks are broken after the merge #3868

Closed pemmadi closed 2 weeks ago

pemmadi commented 2 weeks ago

Description of the bug

I am trying to merge multiple PDFs into a single PDF using PyMuPDF, the merge works but the internal hyperlinks are broken after the merge.

How to reproduce the bug

import fitz  # PyMuPDF

def merge_pdfs(pdf_list, output):
    merged_pdf = fitz.open()
    for pdf in pdf_list:
        with fitz.open(pdf) as mfile:
            merged_pdf.insert_pdf(mfile)
    merged_pdf.save(output)

pdf_files = ['file1.pdf', 'file2.pdf']
merge_pdfs(pdf_files, 'merged_output.pdf')

PyMuPDF version

1.24.0

Operating system

Windows

Python version

3.8

JorjMcKie commented 2 weeks ago

It is mandatory to provide reproducing data when submitting a bug.

pemmadi commented 2 weeks ago

file1.pdf file2.pdf merged_output.pdf

pemmadi commented 2 weeks ago

@JorjMcKie - file1.pdf & file2.pdf have internal links and they are working as expected but after merge they are not working(merged_ouput.pdf)

JorjMcKie commented 2 weeks ago

Please read the documentation here. You will see that "named" internal links are not supported / ignored. As you do not want to provide an example file 😒, you need to check yourself whether file2.pdf has named internal links.

pemmadi commented 2 weeks ago

@JorjMcKie - I am using XSL to read data from XML and creating a ToC with internal links then converting to PDF, below is the code snippet for generating links

<xsl:template name="make-tableofcontents">
    <h2>
        <a name="toc">Table of Contents</a>
    </h2>
    <ul>
        <xsl:for-each select="n1:component/n1:structuredBody/n1:component/n1:section/n1:title">
                <li>
                    <a href="#{generate-id(.)}">
                        <xsl:value-of select="."/>
                    </a>
                </li>
        </xsl:for-each>
    </ul>
</xsl:template>

the generate-id() function in XSLT does not directly create a named destination link. It only generates a unique identifier for an XML node.

Can you help how I can fix this internal links issue, Is there any other way to create links without using named destination?

JorjMcKie commented 2 weeks ago

You can convert named links to GoTo links using PyMuPDF. This script does work:

import pymupdf

doc1 = pymupdf.open("file1.pdf")
doc2 = pymupdf.open("file2.pdf")
for page in doc2:
    links = page.get_links()
    for link in links:  # replace NAMED by GOTO links
        if link["kind"] != pymupdf.LINK_NAMED:
            continue
        nlink = {
            "kind": pymupdf.LINK_GOTO,
            "from": link["from"],
            "to": link["to"],
            "page": link["page"],
            "zoom": link["zoom"],
        }
        page.delete_link(link)  # delete named link
        page.insert_link(nlink)  # insert its GOTO version
    page = doc2.reload_page(page)  # important: finalize page updates!
doc1.insert_pdf(doc2)
doc1.ez_save("merged.pdf")
pemmadi commented 2 weeks ago

@JorjMcKie - Thanks man, its working, really appreciated