pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.33k stars 509 forks source link

Question / Comment: redaction tool #499

Closed martinevanschouwenburg closed 4 years ago

martinevanschouwenburg commented 4 years ago

I've been playing around with the redaction options which are great! (as is the package in general!)

I have a couple of questions regarding the colour of the redaction though (the final redaction, not the markup/annotation)

1) Is it possible to redact a word using a white box surrounded by a black line? I've tried annot.setColors(stroke=(0, 0, 0), fill=(1, 1, 1)), but I get a warning "warning: annot type has no fill color".

2) which brings me to my next question, how come you can set the 'fill' but not the 'stroke' with the function addRedactAnnot, if this particular annot type has no fill color?

3) I've added an annotation like this: page.addRedactAnnot(areas[0], fill=(1, 0, 1)) When I redact in python (page.apply_redactions()) the redaction works like expected and the resulting pdf shows a purple redaction. If I don't redact, and I open the pdf in MuPDF, I see the markup, and when I press 'Redact', the redaction turns black. If I don't redact, and I open the pdf in AdobePDF, I see the markup, and when I press 'Redact', the redaction turns white.

Any thoughts on this? Could it be that different programs handle the features (color in this case) of the redact annotation differently?

Thank you for your time,

Martine

JorjMcKie commented 4 years ago

The handling of annotations in general is highly tool-dependent. Changing an annotation created by some other tool almost certainly produces disappointing results. See these comments in the documentation.

I have taken the liberty to extend MuPDF here (and in a number of other annotation-related aspects): so MuPDF cannot do everything what PyMuPDF does. MuPDF can only put black rectangles over all redaction's areas (or leave them all white). It neither can insert a replacement text. And btw. also PyMuPDF neither fully implements the PDF spec for redactions:

martinevanschouwenburg commented 4 years ago

Thank you for your quick reply. I was hoping to apply the redaction annotations in python, and to be able to have someone else review/edit them in a pdf editor (fe MuPDF or Adobe) before applying the actual redacting. I'll think of another way to do this then.

One another question then; I still don't understand why in the page.addRedactAnnot function you can give fill as an argument, but not stroke, while for the annot.setColors function you give stroke as an argument, but not fill.

JorjMcKie commented 4 years ago

was hoping to apply the redaction annotations in python, and to be able to have someone else review/edit them in a pdf editor (fe MuPDF or Adobe) before applying the actual redacting.

Maybe you should continue looking for an adequate PDF viewer - or a commercial version of Acrobat would do the job. In any case, the parameters I am giving the redaction annotation are conforming to the specs. It's just the two candidates you have tried out so far are insufficient. Another idea - incurring more effort - might be to write your own viewer (taking an existing example as a start) and give a new "redact" button for approval.

I still don't understand why in the page.addRedactAnnot function you can give fill as an argument, but not stroke, while for the annot.setColors function you give stroke as an argument, but not fill.

There are two stages / steps in redaction handling: (1) marking some area of a page (2) deleting the marked area - optionally replacing it with something (just a color background and / or some text).

For step (2) there exists no prescription what to do and how to do it. It was my design decision to use existing PyMuPDF functionality - Page.insertTextbox - for filling in text. The redaction annotation itself has - by PDF spec - no fill color, just stroke. This goes back to, that annotations may have at most two color parameters: /C (stroke color) and /IC (fill color). The latter is not allowed for all annot types - redact only supports /IC. What's important here: that annotation type only accepts one color argument.

The error (warning) message about no fill color is accepted, comes from MuPDF and is false. It actually is exactly the other way round: the red lines of the marking cannot be changed and this is a "stroke" color. So what I would have to do is implementing an exception for redacts here ...

The result of the whole two-step process is a potentially color-filled rectangle, potentially filled with some text - but it in any case is permanent stuff, not an annotation.

BTW. I am already testing v1.17 of MuPDF. To my great relief they now support deleting more or less everything that is covered by some redaction: links, and even (parts of) images. Looks like a good job. But there is no change in the way you would like it to have. What you seem to prefer is replacing the annot rectangle with some other object, like an image. This is possible by the PDF spec, but not implemented. There exists a PDF parameter /RO for this unsupported feature. If this were used, then no text filling would be possible ...

Wrapping up: With a good PDF viewer, you should at least be able to achieve a white / colorless rectangle with the replacement text.

martinevanschouwenburg commented 4 years ago

Thanks for your eloborate answer. That clarifies a lot actually.

So just to explain what I would like to do in the end; I would like to add the redaction annotations with python and then have someone remove some annotations and add some new ones, and then do the actual redaction. My initial thought was to have someone open my annotated pdf and then have them make the changes and press the redact button in the program. (You might be interested to hear that both in Adobe PDF Pro and MuPDF the redaction annotations are visible). But since that is not working with the way the redactions turn out in terms of color, I thought I'll have the end user modify the PDF and them send it back to me, and then I'll do the final redaction step in PyMuPDF.

Now if I add an extra annotation in MuPDF, PyMuPDF recognizes the annotation, but because it has no fill attached to it (and I can't modify that parameter using annot.setColors) it won't actually redact the marked area.

If I add an extra annotation in Adobe, PyMuPDF recognizes the annotation, but it gave an error with the /RO in it.

So this all makes sense now. :) I guess we can close this issue, perhaps I'll be back with more questions later on.

Edited to add: I'll keep looking for a PDF editor that might work.

JorjMcKie commented 4 years ago

I guess we can close this issue, perhaps I'll be back with more questions later on.

Youare welcome, any time. I am generally interested to hear enhancement suggestions. You main issue seems to be though, that you have no PDF viewer at hand which properly supports redact handling. I am a little disappointed, that the Adobe guy wouldn't work either. By the books my redaction annot contains nothing against the specs: the fill color is specified via /IC, the fill text is specified via /OverlayText and the text font spec is cleanly coded via /DA parameter. You can look all that up via print(doc.xrefObject(annot.xref)). Example:

>>> doc=fitz.open("v110-changes.pdf")
>>> rect = fitz.Rect(100,100,300,200)
>>> text="fill me in"
>>> page=doc[0]
>>> annot=page.addRedactAnnot(rect, text, align=fitz.TEXT_ALIGN_CENTER, fill=(1,1,0), text_color=(0,0,1))
>>> print(doc.xrefObject(annot.xref))
<<
  /Type /Annot
  /Subtype /Redact
  /P 6 0 R
  /F 4
  /Rect [ 100 641.92 300 741.92 ]  % rect in PDF coordinates
  /IC [ 1 1 0 ]  % fill color yellow
  /OverlayText (fill me in)  % the text
  /DA (0 0 1 rg /Helv 11 Tf)  % blue text, font Helvetica, fontsize 11
  /Q 1  % align center
  /NM (fitzannot-0)
  /AP <<
    /N 60 0 R
  >>
>>
>>> 
JorjMcKie commented 4 years ago

@martinevanschouwenburg - update: v1.17.0 is out. It supports changing the fill color of Redaction annotations now.

WhoAteDaCake commented 4 years ago

That is odd, because I'm trying out to open redacted areas in Adobe Acrobat DC (Pro) and it turns them white as soon as redaction is applied. My code:

import fitz

doc = fitz.open("samplereport.pdf");
words = ["performance"]

for page in doc:
  for word in words:
      for instance in page.searchFor(word, quads=True):
          annot = page.addRedactAnnot(instance, cross_out=False, fill=(0,0,0))

doc.save("output.pdf", deflate=True)

Bellow you can see how markings differ by metadata

-- Made by PyMuPDF
<<
  /Type /Annot
  /Subtype /Redact
  /P 10 0 R
  /F 4
  /Rect [ 291.43467 601.1166 338.87355 613.7629 ]
  /IC [ 0 0 0 ]
  /NM (fitzannot-0)
  /AP <<
    /N 180 0 R
  >>
>>
-- Made by adobe
<<
  /AP <<
    /D 101 0 R
    /N 102 0 R
    /R 101 0 R
  >>
  /C [ .898026 .133331 .215683 ]
  /CreationDate (D:20201006152108Z)
  /DA (1 0 0 RG 0 g 0 Tc 0 Tw 100 Tz 0 TL 0 Ts 0 Tr /Helv 0 Tf)
  /F 4
  /IC [ 0 0 0 ]
  /M (D:20201006152108Z)
  /NM (779883f1-dcfc-4b0e-aece-d1b2e8832972)
  /OC [ 1 0 0 ]
  /P 10 0 R
  /Popup 117 0 R
  /QuadPoints [ 291.435 613.327 338.874 613.327 291.435 600.87
      338.874 600.87 ]
  /RO 101 0 R
  /Rect [ 289.935 599.37 340.374 614.827 ]
  /Subj (Redact)
  /Subtype /Redact
  /T (robot)
  /Type /Annot
>>

If I open the reader, check properties of redaction and save it, it turns black when redacted... Is there any reason why that would happen ?

JorjMcKie commented 4 years ago

@WhoAteDaCake - Like with other annotation types, handling differs greatly between tools. You may want to read this comment in the documentation.

Specifically the handling of redacts in PyMuPDF is significantly extended compared to MuPDF, which does not support the following:

So if you modify / apply redactions using a MuPDF viewer, it will behave as designed and ignore the previous 3 bullets. It will similarly ignore out-of-scope features for redactions generated by every other tool (Adobe or whatever).

Similarly, if you update a redaction (before applying) with any tool, it will be turned into the standard that this tool happens to support.

Specifically, neither MuPDF, nor PyMuPDF support the /RO PDF key (I ignore it with a warning at least I believe, MuPDF keep its mouth shut entirely when stumbling over it).

As can be seen clearly in your example, PyMuPDF follows the strategy to insert the minimal amount of data when you request an annotation insertion: No automatic author name, datetime, subject, popup, etc. If you want any of those data, you must put them in yourself. This maximises your control (and minimizes my effort 😉). But seriously, I think this is adequate for a programming tool. The Adobe viewer in contrast is an application / solution, for which other criteria may apply. I also always use the redaction rectangle as the area to scissor out text underneath. This is because MuPDF cannot yet correctly determine if a text character is inside a quad when it is not a rectangle (but instead something like a parallelogram maybe). This avoids disappointments and unnecessary issues ... 😎

Finally: obviously Adobe supports and uses the /RO key. This feature allows an image or similar to be put in the redaction rectangle after applying. What that is in your example (xref 101) cannot be seen, but probably not important right now.

JorjMcKie commented 4 years ago

With the coming version 1.18.0 of (Py-) MuPDF, images overlapping redacted areas can be treated in 3 different ways:

Independently from that and as before, you can choose a redact background color, which after applying will permantly overlay any image underneath.

WhoAteDaCake commented 4 years ago

Thank you for such a detailed response. This tool is amazing and definitely saved me loads of time ! I guess my only solution is to set those properties manually ? (Only way I've found how to do that is to use PyPDF2 library so far)

JorjMcKie commented 4 years ago

I guess my only solution is to set those properties manually ?

What does "manually" mean here? You can set pretty much all properties of annotations in PyMuPDF - at least as far as is "legal" for the specific annot type (some types cannot have a fill color etc.). Whatever it is: author, modDate, colors, borders, blend mode, ... you name it. Either by annot.setXXXX or some also as a parameter in annot.update(...).

But your main issue seemed to be: If you update an annotation (or apply redactions) with some specific tool, then the result will reflect the capabilities of that tool - ignoring whatever properties the annotation has had before. So you should try to stick with the same tool for creating and for updating.

If you cannot figure out how to set some property, just ask.

WhoAteDaCake commented 4 years ago

Sorry, by manually I meant as in setting properties one by one, doing it on a lower level.

So you should try to stick with the same tool for creating and for updating.

Ideally I would, but I need to automate creating redaction annotations, so they can later be approved or rejected. As the review is done in Adobe Reader, there isn't a way to go around it. I've attempted to do this within JavaScript execution environment of the reader, but it doesn't work well.

The only approach I thought of was going to try and match the output of Adobe redaction tool. For example:

JorjMcKie commented 4 years ago

Ah ok, got you. That won't be easy. You certainly can use the low-level functions of PyMuPDF - even down to manipulation the object definition string of the annotation and insert a line containing /RO nnn 0 R. The xref nnn would probably (did not test it) have to be that of an image, which you must also insert / know. Maybe you have an idea for how to do that already?

JorjMcKie commented 4 years ago

Just had another look in the PDF docu: The object referenced by /RO key must be "Form XObject" - not an image. Form XObject are created by PyMuPDF when executing showPDFpage. This method returns the XObject' xref.

If you don't have a better way, you could maybe try the following hack (assuming you want to show a certain image in the redact annot's rectangle after applying):

  1. open image as a fitz document
  2. convert it to PDF and open that PDF
  3. create a new page in the document (will later be deleted again) and execute showPDFpage on that temporary page using the image-PDF. You can reuse the annot's rectangle for this method.
  4. take the xref nnn returned by step 3 and manipulate your redact object with it:
source = doc.xrefObject(annot.xref)
# split into lines and insert the following line e.g. after line ``Rect [ ... ]``:
# /RO nnn 0 R
# then do this:
doc.updateObject(annot.xref, source)
# now the annot is prepared
# delete the temporary page from doc:
doc.deletePage(-1)
JorjMcKie commented 4 years ago

Good luck! Please keep me in the loop.