pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.54k stars 447 forks source link

Page.apply_redactions() removes more text than expected in the pdf document. #3433

Closed dameyerdave closed 1 week ago

dameyerdave commented 2 months ago

Description of the bug

As soon as I apply the reductions all the text and graphics get lost from the pdf.

Source:

Receipt pdf

Annotated:

Receipt_annot

After apply_reductions():

Screenshot 2024-05-02 at 14 40 02

How to reproduce the bug

This is the code I wrote to come tho this:

doc = fitz.open("./Receipt.pdf")
for page in doc:
    for text in some_text_array:
        for area in page.search_for(text, quads=True):
            reduction = page.add_redact_annot(
                area,
                fill=(0, 0, 0),
            )
            reduction.update()

    # here it happens
    page.apply_redactions(0,0,0)

doc.save("./redacted.pdf")
doc.close()

PyMuPDF version

1.24.2

Operating system

MacOS

Python version

3.10

JorjMcKie commented 2 months ago

Please provide all mandatory information - in this case, the reproducing file is missing.

dameyerdave commented 2 months ago

I'm sorry for that. These are the files:

JorjMcKie commented 2 months ago

Thanks for the examples. Sorry I cannot find a problem. Made a redaction to remove "David Meyer" and it simply worked!

for r in page.search_for("david meyer"):
    page.add_redact_annot(r)

'Redact' annotation on page 0 of original.pdf
page.apply_redactions(0,0,0)
True
doc.ez_save("x-1.24.2.pdf")

image

In the meantime, I also redacted other parts of the page (the text "October 19, 2023") , and they also worked without complaints.

aleem75321 commented 2 months ago

HI @JorjMcKie I have faced the same issue while applying Redaction. they remove images which should not be removed or changing text. test.pdf test2.pdf

I have attached both pdf to reproduce the issue

test_Original_image test_after_redacttion

test2_Original_image test2_after_redacttion test2_Original_text_issue test2_after_redact_text_issue

Code:-

import fitz
from pathlib import Path

file_path=Path(r"test_pages/test.pdf")

doc=fitz.open(file_path)
page=doc[0]

blocks=page.get_text("rawdict",flags=fitz.TEXTFLAGS_TEXT,sort=True)["blocks"]  
#Set Colour for outoput PDF
Red = fitz.pdfcolor["red"]

for b in  blocks:
    for l in b["lines"]:  
        for s in l["spans"]:
            for c in s["chars"]:

                if s["size"]>15 and s['color']==2236191: 
                    if c['c']== "ं":
                        try:
                            font = fitz.Font(fontname=s['font'],fontfile=f"{s['font']}.ttf")  # this must be known somehow - or simply try some font else
                        except Exception as e:
                            print(str(e))  
                        redact_box = fitz.Rect(c["bbox"]) 
                        origin_text = fitz.Point(c["origin"]) 
                        redact_box.y1 = redact_box.y1-s['size'] 
                        page.add_redact_annot(redact_box) 
                        # Apply reactions after all text replacements
                        page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE,graphics=fitz.PDF_REDACT_LINE_ART_NONE)
                        # Create Text writer to Write in Page with choose Color
                        tw = fitz.TextWriter(page.rect,color=Red)  
                        #re-insert same text - different color
                        tw.append((origin_text.x,origin_text.y), text=c['c'],fontsize=s['size'],font=font)
                        tw.write_text(page) 

#Saving Backup File furture use 
out_fpath="OUT/"+file_path.stem+".pdf"
doc.save(out_fpath,garbage=3, deflate=True)
doc.close()

PyMuPDF version 1.24.2

Operating system windows

Python version 3.11.4

JorjMcKie commented 2 months ago

@aleem75321 please submit this as a different issue - this is too confusing in this context. When you do, please save the PDF when you have inserted all redactions - before applying them. I need to confirm where your code has put them - without the need to understand your code. Then attach this PDF to confirm that bad things happen on applying redactions.

aleem75321 commented 2 months ago

I have summited different issues please see the below link.

Facing Issues after applying redactions they delete some Images or Icons #3439

dameyerdave commented 2 months ago

I reduced the application to the bare minimum. I still encounter the same issue. I tried it on mac M3 and on ubuntu linux (Intel) as well as in a docker container with platform: linux/amd64 without success.

import fitz

doc = fitz.open("./original.pdf")
for page in doc:
    for r in page.search_for("David Meyer"):
        page.add_redact_annot(r)

    page.apply_redactions(0, 0)
doc.ez_save("redacted.pdf")

With the following files:

original.pdf redacted.pdf

I don't know what to try now... If you have another good idea, please let me know...

JorjMcKie commented 2 months ago

@dameyerdave we (a colleague of mine and I) have tried on all 3 platforms now Mac, Linux, Win with fitz.version=('1.24.2', '1.24.1', '20240417000001') and are getting the correct, flawless result. 🤷‍♂️ That is no black rectangle and "David Meyer" removed in total.

JorjMcKie commented 2 months ago

My only advice is to re-install 1.24.2. There has been a redaction issue previously. I will try with 1 or 2 previous versions.

JorjMcKie commented 2 months ago

No such luck: At least on windows, all versions back to 1.23.26 do work correctly. So you probably best re-install with the latest version.

luchux commented 1 month ago

We are facing exactly the same as everybody posting the bug in this thread. Our version in the env is Name: PyMuPDF Version: 1.24.0

I tried removing the apply_redaction(images=0) and also used all the combos possible for the parameter. Also tried removing garbage collectors, and deflates when saving.

Exactly the same error as other people:

Original PDF before redaction

Screenshot 2024-05-06 at 6 13 11 PM

After apply.redaction to text "Origin"

Screenshot 2024-05-06 at 6 14 09 PM

We would love to know if you are aware of this bug, and if there is a stable version that works properly without this bug. Thanks a lot!

luchux commented 1 month ago

Another example. Now tested 3 versions: 1.24.0, 1.24.2 failing.

1.23.26: working well ! redaction works

Original before redaction:

Screenshot 2024-05-06 at 6 58 32 PM

After text redacted 1.24.0 and 1.24.2:

Screenshot 2024-05-06 at 6 58 14 PM

after text redacted with 1.23.26 (working!)

Screenshot 2024-05-06 at 7 56 49 PM
JorjMcKie commented 1 month ago

@luchux - "A picture is worth a thousand words."

Certainly true. But a thousand pictures are not worth a million words! Please add an example file and no more pictures if we should confirm that yours is another duplicate of #3376.

Please also note, that the problem of this post is yet not reproducible and thus unclear whether it is a bug at all.

JorjMcKie commented 1 week ago

Closing this for lack of information since a long time.