pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

Redaction Annotation Fill Not Matching Up With Redacted Section #3575

Closed lyon-tonic closed 1 week ago

lyon-tonic commented 2 weeks ago

Description of the bug

I am trying to redact words from a PDF, based on OCR-generated rectangles.

PyMuPdf has worked well for us, but I have run into a strange situation with a specific file that has some strange properties. (I've attached the file). The pages in this file are an abnormal size (8.5 x 6.5 in) and some of them are rotated.

I would like to have the coordinates in the rectangles relative to the top left, but even before I do that, I have noticed that the redacted rectangle is not in the same place as the fill.

If this is not a bug, I would like to understand why these appear to be being drawn on separate coordinate systems, and how to reconcile them.

image

How to reproduce the bug

This is a simple script that shows the problem in the files below:

Input: input.pdf

Output: output.pdf

import fitz  # PyMuPDF

def process_pdf(input_pdf_path, output_pdf_path):
    # Open the input PDF file
    document = fitz.open(input_pdf_path)

    # Iterate through each page
    for page_num in range(len(document)):
        page = document.load_page(page_num)  # load page

        # 234 is half of the width of the page
        rect = fitz.Rect(0, 0, 234, 234)

        redact_annot = page.add_redact_annot(rect)
        redact_annot.update(fill_color=(0, 0, 0))  # set fill color to black
        page.apply_redactions()
        page.insert_textbox(rect, f"Page {page_num + 1}", fontsize=12, fontname="helv", color=(1, 0, 0))

    document.save(output_pdf_path)

if __name__ == "__main__":
    input_pdf_path = "input.pdf"  # Replace with the path to your input PDF
    output_pdf_path = "output.pdf"  # Replace with the path to your output PDF

    process_pdf(input_pdf_path, output_pdf_path)
    print(f"Processed PDF saved to {output_pdf_path}")

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.11

JorjMcKie commented 2 weeks ago

Inserting / Adding stuff to rotated pages can be confusing. For most methods in PyMuPDF you must pass rotated coordinates (for points, rectangles, ...) to get them in the right place. I think this script does what you want:

import pymupdf as fitz  # PyMuPDF

RED = fitz.pdfcolor["red"]

def process_pdf(input_pdf_path, output_pdf_path):
    # Open the input PDF file
    document = fitz.open(input_pdf_path)

    # Iterate through each page
    for page in document:
        # 234 is half of the width of the page
        rect = fitz.Rect(0, 0, 234, 234)
        rot_rect = rect * page.derotation_matrix
        redact_annot = page.add_redact_annot(
            rot_rect, text=f"{page.number=}", text_color=RED
        )
        page.apply_redactions()

    document.ez_save(output_pdf_path)

if __name__ == "__main__":
    input_pdf_path = "input.pdf"  # Replace with the path to your input PDF
    output_pdf_path = "output.pdf"  # Replace with the path to your output PDF

    process_pdf(input_pdf_path, output_pdf_path)
    print(f"Processed PDF saved to {output_pdf_path}")
lyon-tonic commented 2 weeks ago

Thanks for responding!

This is part of the issue, but it is still not solving the issue of the redact_annot fill. The fill rectangle appears to be rendering separately from the redact_annot, and I'm not sure why.

The black fill rect is not showing up here.

import pymupdf as fitz  # PyMuPDF

RED = fitz.pdfcolor["red"]

def process_pdf(input_pdf_path, output_pdf_path):
    # Open the input PDF file
    document = fitz.open(input_pdf_path)

    # Iterate through each page
    for page in document:
        # 234 is half of the width of the page
        rect = fitz.Rect(0, 0, 234, 234)
        rot_rect = rect * page.derotation_matrix
        redact_annot = page.add_redact_annot(
            rot_rect, text=f"{page.number=}", text_color=RED
        )
        redact_annot.update(fill_color=(0, 0, 0))  # set fill color to black
        page.apply_redactions()

    document.ez_save(output_pdf_path)

if __name__ == "__main__":
    input_pdf_path = "input.pdf"  # Replace with the path to your input PDF
    output_pdf_path = "output.pdf"  # Replace with the path to your output PDF

    process_pdf(input_pdf_path, output_pdf_path)
    print(f"Processed PDF saved to {output_pdf_path}")
JorjMcKie commented 2 weeks ago

This file indeed does a few unexpected things! Here is a complete solution that removes the page rotations.

import pymupdf as fitz  # PyMuPDF

RED = fitz.pdfcolor["red"]
BLACK = fitz.pdfcolor["black"]

def process_pdf(input_pdf_path, output_pdf_path):
    rect = fitz.Rect(0, 0, 234, 234)

    # Open the input PDF file
    src = fitz.open(input_pdf_path)
    doc = fitz.open()  # output file

    # Iterate through each page
    for src_page in src:
        # the output PDF will contain pages with rotation 0
        src_rect = src_page.rect
        w, h = src_rect.br
        src_rot = src_page.rotation
        src_page.set_rotation(0)
        # make output page having the visible dimension of the input
        page = doc.new_page(width=w, height=h)
        page.show_pdf_page(  # insert source page
            page.rect,
            src,
            src_page.number,
            rotate=-src_rot,  # reversed original rotation
        )

        # now we can redact in a worry-free manner
        redact_annot = page.add_redact_annot(
            rect, text=f"{page.number=}", text_color=RED, fill=BLACK
        )
        page.apply_redactions()

    doc.ez_save(output_pdf_path)

if __name__ == "__main__":
    input_pdf_path = "input.pdf"  # Replace with the path to your input PDF
    output_pdf_path = "output.pdf"  # Replace with the path to your output PDF

    process_pdf(input_pdf_path, output_pdf_path)
    print(f"Processed PDF saved to {output_pdf_path}")
JorjMcKie commented 1 week ago

Close issue for lack of reaction.