pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.52k stars 516 forks source link

apply_redactions() does not work as expected #3863

Closed nsklei closed 1 month ago

nsklei commented 1 month ago

Description of the bug

When using apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE) I get several "MuPDF error: syntax error: cannot find XObject resource" errors and as well there are some pages which are completely empty, altough all pages originally contain images.

How to reproduce the bug

import pymupdf
from io import BytesIO
from pathlib import Path

file_path = "path\to\Example_PDF.pdf"
output_path = "path\to\Example_PDF_redacted.pdf"

new_doc = pymupdf.open(file_path)

for num, page in enumerate(new_doc):
    print(f"Page {num + 1} - {page.rect}:")

    for image in page.get_images(full=True):
        print(f"  - Image: {image}")

    redact_rect = page.rect

    if page.rotation in {90, 270}:
        redact_rect = pymupdf.Rect(0, 0, page.rect.height, page.rect.width)

    page.add_redact_annot(redact_rect)
    page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE)

byte_stream = BytesIO()
new_doc.save(byte_stream)
byte_stream.seek(0)

Path(output_path).write_bytes(byte_stream.getvalue())

The code above prints the following information:

Page 1 - Rect(0.0, 0.0, 598.3200073242188, 813.5999755859375):
  - Image: (22, 0, 554, 754, 8, 'ICCBased', '', 'Im0', 'DCTDecode', 0)
  - Image: (23, 43, 554, 754, 8, 'ICCBased', '', 'Im1', 'DCTDecode', 0)
Page 2 - Rect(0.0, 0.0, 598.3200073242188, 816.47998046875):
  - Image: (25, 0, 554, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (26, 44, 554, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 3 - Rect(0.0, 0.0, 815.760009765625, 596.8800048828125):
  - Image: (28, 0, 553, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (29, 45, 553, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 4 - Rect(0.0, 0.0, 815.760009765625, 597.5999755859375):
  - Image: (31, 0, 554, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (32, 46, 554, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 5 - Rect(0.0, 0.0, 815.0399780273438, 597.5999755859375):
  - Image: (34, 0, 554, 755, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (35, 47, 554, 755, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 6 - Rect(0.0, 0.0, 806.4000244140625, 598.3200073242188):
  - Image: (37, 0, 554, 747, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (38, 48, 554, 747, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 7 - Rect(0.0, 0.0, 815.0399780273438, 597.5999755859375):
  - Image: (39, 0, 554, 755, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (40, 49, 554, 755, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
MuPDF error: syntax error: cannot find XObject resource 'Im1'

MuPDF error: syntax error: cannot find XObject resource 'Im2'

Page 8 - Rect(0.0, 0.0, 815.760009765625, 596.8800048828125):
  - Image: (41, 0, 553, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (42, 50, 553, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
MuPDF error: syntax error: cannot find XObject resource 'Im1'

MuPDF error: syntax error: cannot find XObject resource 'Im2'

As you can see, each page contains two images. The function should remove all content from the PDF file except the images. But when saving the byte_stream there are some pages completely empy.

PyMuPDF version

1.24.10

Operating system

Windows

Python version

3.12

JorjMcKie commented 1 month ago

This post cannot be accepted as a bug report because no reproducer file is provided.

JorjMcKie commented 1 month ago

test2.pdf

MuPDF bug report: https://bugs.ghostscript.com/show_bug.cgi?id=708032.

JorjMcKie commented 1 month ago

@nsklei - You are aware that all pages only contain images - no text, no vector graphics. So your redactions effectively are no-ops!

nsklei commented 1 month ago

Thank you for reviewing my issue and creating a bug report. The described behaviour in your bug report is correct. I am aware, that all pages only contain images and nothing else, so the redactions should indeed be no-ops in this case.

JorjMcKie commented 1 month ago

I found that removing page rotation avoids the problem:

for page in doc:
    page.add_redact_annot(page.rect * page.derotation_matrix)
    page.remove_rotation()
    page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE)

Works without problem.

nsklei commented 1 month ago

Thank you for providing a solution to my problem. I tested your suggestion and it works perfectly :)

JorjMcKie commented 1 month ago

Thanks for the feedback! I am going to re-open this until the fix itself is publicly available. This is our policy for dealing with issue resolutions.

sebras commented 1 month ago

@JorjMcKie This appears to have been fixed upstream, so can be marked "fix developed"?

julian-smith-artifex-com commented 1 month ago

Fixed in 1.24.11.