apply_redactions() does not work as expected

nsklei commented 1 month ago

Description of the bug

When using apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE) I get several "MuPDF error: syntax error: cannot find XObject resource" errors and as well there are some pages which are completely empty, altough all pages originally contain images.

How to reproduce the bug

import pymupdf
from io import BytesIO
from pathlib import Path

file_path = "path\to\Example_PDF.pdf"
output_path = "path\to\Example_PDF_redacted.pdf"

new_doc = pymupdf.open(file_path)

for num, page in enumerate(new_doc):
    print(f"Page {num + 1} - {page.rect}:")

    for image in page.get_images(full=True):
        print(f"  - Image: {image}")

    redact_rect = page.rect

    if page.rotation in {90, 270}:
        redact_rect = pymupdf.Rect(0, 0, page.rect.height, page.rect.width)

    page.add_redact_annot(redact_rect)
    page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE)

byte_stream = BytesIO()
new_doc.save(byte_stream)
byte_stream.seek(0)

Path(output_path).write_bytes(byte_stream.getvalue())

The code above prints the following information:

Page 1 - Rect(0.0, 0.0, 598.3200073242188, 813.5999755859375):
  - Image: (22, 0, 554, 754, 8, 'ICCBased', '', 'Im0', 'DCTDecode', 0)
  - Image: (23, 43, 554, 754, 8, 'ICCBased', '', 'Im1', 'DCTDecode', 0)
Page 2 - Rect(0.0, 0.0, 598.3200073242188, 816.47998046875):
  - Image: (25, 0, 554, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (26, 44, 554, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 3 - Rect(0.0, 0.0, 815.760009765625, 596.8800048828125):
  - Image: (28, 0, 553, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (29, 45, 553, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 4 - Rect(0.0, 0.0, 815.760009765625, 597.5999755859375):
  - Image: (31, 0, 554, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (32, 46, 554, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 5 - Rect(0.0, 0.0, 815.0399780273438, 597.5999755859375):
  - Image: (34, 0, 554, 755, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (35, 47, 554, 755, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 6 - Rect(0.0, 0.0, 806.4000244140625, 598.3200073242188):
  - Image: (37, 0, 554, 747, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (38, 48, 554, 747, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 7 - Rect(0.0, 0.0, 815.0399780273438, 597.5999755859375):
  - Image: (39, 0, 554, 755, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (40, 49, 554, 755, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
MuPDF error: syntax error: cannot find XObject resource 'Im1'

MuPDF error: syntax error: cannot find XObject resource 'Im2'

Page 8 - Rect(0.0, 0.0, 815.760009765625, 596.8800048828125):
  - Image: (41, 0, 553, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (42, 50, 553, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
MuPDF error: syntax error: cannot find XObject resource 'Im1'

MuPDF error: syntax error: cannot find XObject resource 'Im2'

As you can see, each page contains two images. The function should remove all content from the PDF file except the images. But when saving the byte_stream there are some pages completely empy.

PyMuPDF version

1.24.10

Operating system

Windows