pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

ObjStm compression and PDF linearization doesn't work together #3603

Open SteveHawk opened 1 week ago

SteveHawk commented 1 week ago

Description of the bug

Since v1.24.1 introduced use_objstms option in Document.save(), setting use_objstms=1 and linear=True together doesn't work on some documents, results in a broken PDF file. On version >= 1.24.3, some documents even cause the program to crash.

How to reproduce the bug

Here's a minimal reproducible program:

import fitz

def test(filename: str) -> None:
    with fitz.open(filename) as doc:
        doc.ez_save("output.pdf", use_objstms=1, linear=True)
    with fitz.open("output.pdf") as doc:
        for page in doc:
            page.get_pixmap(dpi=72)

test("2401.08541v1.pdf")
test("1706.03762v7.pdf")

We ran into the problem when processing some internal documents, but managed to reproduce the issue on two random paper downloaded from arXiv. Here are the files:

1706.03762v7.pdf 2401.08541v1.pdf

When running the program, it spits out error logs like below during the pixmap generation, possibly due to the file is broken.

MuPDF error: syntax error: cannot find XObject resource 'Im1'

MuPDF error: syntax error: cannot find XObject resource 'Im2'

MuPDF error: syntax error: cannot find XObject resource 'Im3'

MuPDF error: syntax error: cannot find XObject resource 'Fm1'

MuPDF error: syntax error: cannot find XObject resource 'Fm2'

MuPDF error: syntax error: cannot find XObject resource 'Fm3'

MuPDF error: syntax error: cannot find XObject resource 'Fm4'

MuPDF error: syntax error: cannot find XObject resource 'Fm5'

And the result PDF file is either blank or only contains some lines with no texts when opening in Ubuntu's Evince document viewer. Opening it in chrome does show texts, but the font is altered and figures are gone.

Also, it seems like turning on garbage collection affects the crash pattern, when using ez_save, the first file crashes the program, when using save with no gc, the second file crashes the program. They all crash with such log:

realloc(): invalid next size
fish: Job 1, 'python test.py' terminated by signal SIGABRT (Abort)

PyMuPDF version

1.24.5

Operating system

Linux

Python version

3.11

JorjMcKie commented 1 week ago

Thank you for submitting this.

This happens inside the base library MuPDF. I am going to transfer the issue to the team for investigation.

JorjMcKie commented 1 week ago

MuPDF issue reference https://bugs.ghostscript.com/show_bug.cgi?id=707835

SteveHawk commented 1 week ago

@JorjMcKie Thanks!