pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.17k stars 495 forks source link

Linearizing PDF does not speed up web viewing #3900

Open blenzi opened 3 days ago

blenzi commented 3 days ago

Description of the bug

Hi,

I am trying to save a linearised version of a PDF to speed up its web viewing. With Adobe API it does it on a Firefox / pdfjs viewer but with PyMuPDF it doesn't seem to help.

Original PDF: https://minio.lab.sspcloud.fr/blenzi/public/org64M.pdf Linearized with pymupdf 1.24.10: https://minio.lab.sspcloud.fr/blenzi/public/linear.pdf Linearized with Adobe API: https://minio.lab.sspcloud.fr/blenzi/public/LinearizePDF.pdf

Any ideas?

Thanks in advance

How to reproduce the bug

import fitz

doc = fitz.open("org64M.pdf")
with fitz.open() as doc2:
    doc2.insert_pdf(doc)
    doc2.save("linear.pdf", linear=True)

with fitz.open("linear.pdf") as doc2, fitz.open("LinearizePDF.pdf") as doc3:
    print(doc.is_fast_webaccess, doc2.is_fast_webaccess, doc3.is_fast_webaccess)

# Output: 0 1 1

PyMuPDF version

1.24.10

Operating system

Linux

Python version

3.12

JorjMcKie commented 8 hours ago

The linearized version is created by MuPDF code. PyMuPDF cannot do anything here. So I suggest you submit bug in their issue system, https://bugs.ghostscript.com/enter_bug.cgi. You probably should also discuss with the MuPDF team on this Discord channel.

Other than that, I suggest to also use some garbage collection option when saving the PDF like garbage=3. Creating the linearized format drastically changes a PDF's internal structure, leaving behind now-unused object. I would also make sure to compress eligible objects with deflate=True.