pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.1k stars 491 forks source link

apply_redactions causes part of the page content to be hidden / transparent #3751

Open beeing opened 1 month ago

beeing commented 1 month ago

Description of the bug

I'm adding a redaction region to a part of the PDF, but after calling apply_redactions(), one side of the entire page goes missing (opened in macOS Preview app or Safari).

Further inspection reveals that the text is not missing, as it is selectable and can be copied out properly. It is either the text has been masked / hidden, but I could not find out how to check further (sorry, my limited knowledge on PDF structure).

Untitled

The media, crop, art, bleed, trim boxes all looks fine before and after the redactions. In fact, I'm trying to check if there's other paths, objects that may be causing it but there's nothing.

Note that I'm not able to share the actual PDF but it was generated from Puppeteer / Chromium (PDF ver 1.7).

Thanks in advance for looking into this.

How to reproduce the bug

  1. Generate PDF from Chromium / Puppeteer
  2. Add redaction of any size eg. (0,0,1,1) and call apply_redactions()
  3. Open the PDF in Preview App.

PyMuPDF version

1.24.9

Operating system

MacOS

Python version

3.9

Kyrylo-Hrytsenko commented 1 month ago

I faced the same problem. Note: this problem does not exist on version 1.23.9, all higher versions have it

There is also an interesting thing. If you open redacted PDF via Chrome or LibreOffice it will look as expected. But the issue is reproducible at least with Mac Preview and react-pdf-viewer lib.

JorjMcKie commented 1 month ago

As always: please provide an example PDF! There is no way to otherwise deal with this post.

beeing commented 1 month ago

I've just tested and it works on older version (up to pymupdf-1.23.26).

Perhaps easier to compare the commits since https://github.com/pymupdf/PyMuPDF/commit/a868c0a556e39198549e4e139534bb12b2623c5d until HEAD.

Kyrylo-Hrytsenko commented 1 month ago

@JorjMcKie Here is an example original.pdf redacted.pdf

Please, open the redacted file with the Preview app, it will look like this

Screenshot 2024-08-06 at 3 38 40 PM

The code for redaction looks like this:

pdfIn = fitz.open(input_file)

out_buffer = BytesIO()

page = pdfIn[0]

page.add_redact_annot([0,0,100,100], text=None, fill=(0, 0, 0))
page.add_redact_annot([100,100,200,200], text=None, fill=(0, 0, 0))
page.add_redact_annot([200,200,300,300], text=None, fill=(0, 0, 0))
page.add_redact_annot([300,300,400,400], text=None, fill=(0, 0, 0))

page.apply_redactions()

pdfIn.save(out_buffer, garbage=3, deflate=True)
pdfIn.close()

with open(output_file, mode='wb') as f:
    f.write(out_buffer.getbuffer())
f.close()
Kyrylo-Hrytsenko commented 1 month ago

@JorjMcKie May I ask if you have received the PDF for reproducing the issue?

JorjMcKie commented 1 month ago

@Kyrylo-Hrytsenko Thanks, I did.

I executed the script and found no problem at all using v1.24.9. I modified the script somewhat so redaction rectangles are visible and erased areas are not filled:

import pymupdf

print(pymupdf.version)
pdfIn = pymupdf.open("original.pdf")

page = pdfIn[0]
rects = (
    [0, 0, 100, 100],
    [100, 100, 200, 200],
    [200, 200, 300, 300],
    [300, 300, 400, 400],
)
for r in rects:
    page.draw_rect(r, color=(1, 0, 0))
    page.add_redact_annot(r)

page.apply_redactions()

pdfIn.ez_save("output.pdf")

Gives this correct result: output.pdf

Kyrylo-Hrytsenko commented 1 month ago

@JorjMcKie Your result file looks like this for me in the Preview app:

Screenshot 2024-08-16 at 1 59 20 PM

Notes:

Does the output.pdf look normal when you open it in the 'Preview' application?

JorjMcKie commented 1 month ago

I do not use or have Preview. My file is displayed in all PDF viewers like Adobe Acrobat, Foxit, Nitro, PDF XChange, evince (Linux). So all authoritative applications behave correctly. No idea what is wrong with Preview.

jamie-lemon commented 1 month ago

Can confirm also see this problem in Preview. However it is fine when I open in Adobe Acrobat. To me this feels like a Preview rendering bug. I would submit a bug to Apple if that is possible!

JorjMcKie commented 1 month ago

@jamie-lemon absolutely correct! I was about to write a similar comment. We will now close this issue.

Kyrylo-Hrytsenko commented 1 month ago

@jamie-lemon @JorjMcKie I don't think it's Preview bug only, here is why:

yuhuang-cst commented 1 month ago

@jamie-lemon @JorjMcKie I don't think it's Preview bug only, here is why:

  • For me, it happens not only with Preview but also with the react-pdf-viewer library at least
  • With an older version of your library everything works fine, which means something was changed and caused this issue
  • Original files (before redaction) render correctly with Preview and with react-pdf-viewer, which means something in the redaction process causes this issue.

@JorjMcKie @jamie-lemon In addition to Mac Preview, Safari, UPDF, and PDF Expert also fail to display output.pdf correctly.

jamie-lemon commented 1 month ago

This is a strange bug - I thought it might be related to the content on page 1, but if I simplify things, target the 2nd page with an area redaction with:

import pymupdf

print(pymupdf.version)
pdfIn = pymupdf.open("orginal.pdf")

page = pdfIn[1] #2nd page
rects = (
    [0, 0, 100, 100],

)
for r in rects:
    page.draw_rect(r, color=(1, 0, 0))
    page.add_redact_annot(r)

page.apply_redactions()

pdfIn.ez_save("redacted.pdf")

Then I get:

Screenshot 2024-08-16 at 18 05 59

I also noticed that it doesn't matter how big the area redaction, I could do this:

rects = (
    [0, 0, 0, 0],
)

And achieve the same resulting problem with the left hand side of the page. I could also put that rect anywhere on the page - it didn't have to be in the top left.

Testing with other documents, redacting and viewing in Preview I don't find this issue at all, so I think there must be something very specific to this document which will need further research.

Kyrylo-Hrytsenko commented 1 month ago

@jamie-lemon

so I think there must be something very specific to this document which will need further research.

Totally agree. Only a small number of my documents have this bug. I didn't even plan to write to you but then noticed that this is happening not only to me and the bug was already created, so I added my example as well.

jamie-lemon commented 1 month ago

@Kyrylo-Hrytsenko Much appreciated!

jamie-lemon commented 1 month ago

This is the simplest case I could find - I made this PDF in Adobe Acrobat, then took it into Preview and then did "Export" as a new PDF.

preview-made.pdf

When you redact with PyMuPDF the logo disappears when you view it in Preview, e.g.

Screenshot 2024-08-16 at 21 24 35
jamie-lemon commented 1 month ago

So it seems if the PDF is made in Preview then this might have something to do with the problem.

yuhuang-cst commented 1 month ago

The issue I am encountering is that if apply_redactions is used, the vector graphics on the page all move to the bottom left corner in Preview, Safari, UPDF, and PDF Expert, whereas they display correctly in Chrome, Adobe Acrobat Reader, and WPS. Here is the code:

import fitz
doc = fitz.open('origin.pdf')
page = doc.load_page(0)
page.add_redact_annot((0, 0, 0 ,0), fill=False)
page.apply_redactions()
doc.ez_save('apply_redaction.pdf')
doc.close()

origin.pdf apply_redaction.pdf

image

The origin.pdf is from the second page of the AlphaGo paper: https://www.nature.com/articles/nature16961

yuhuang-cst commented 1 month ago

The PDF generated with PyMuPDF version 1.23.26 displays the vector graphics correctly in Preview (although the image in the top right corner is partially missing). However, starting from version 1.24.0, there is a bug where the vector graphics are moved to the bottom left corner. apply_redaction_1.23.26.pdf apply_redaction_1.24.0.pdf

JorjMcKie commented 1 month ago

It seems that primarily Mac-based tools have problems with redacted PDFs that have been created with Preview. I am experimenting with the MuPDF development version 1.25.0. The current PyMuPDF v1.24.9 uses MuPDF v1.24.8.

When creating and applying annotations using PyMuPDF 1.24.9 with MuPDF 1.25.0 I do no longer see the error using the Firefox browser - which does behave awkwardly as all those Mac apps.

I am attaching the produced output.pdf inviting Mac users to access it with their Preview on Mac: output.pdf

yuhuang-cst commented 1 month ago

It seems that primarily Mac-based tools have problems with redacted PDFs that have been created with Preview. I am experimenting with the MuPDF development version 1.25.0. The current PyMuPDF v1.24.9 uses MuPDF v1.24.8.

When creating and applying annotations using PyMuPDF 1.24.9 with MuPDF 1.25.0 I do no longer see the error using the Firefox browser - which does behave awkwardly as all those Mac apps.

I am attaching the produced output.pdf inviting Mac users to access it with their Preview on Mac: output.pdf

image

It seems that this bug still exists in Mac Preview.

JorjMcKie commented 1 month ago

@yuhuang-cst thanks for the feedback anyway

jamie-lemon commented 1 month ago

Can also confirm that the bug doesn't exist in PyMuPDF version 1.23.9

JorjMcKie commented 4 weeks ago

I have submitted a problem report in MuPDF's system here:https://bugs.ghostscript.com/show_bug.cgi?id=707966