pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.3k stars 506 forks source link

page.apply_redactions() throws RuntimeError if key /RO is present in annotation #637

Closed ishineko closed 4 years ago

ishineko commented 4 years ago

Please provide all mandatory information!

Describe the bug (mandatory)

When "flattening" a PDF with redaction annotations by calling page.apply_redactions(), there's an exception thrown if the PDF annotation has the key /RO set.

To Reproduce (mandatory)

Take a PDF with a redaction annotation with the key /RO set (Overlay Text), and execute the method page.apply_redactions(). PyMuPDF crashes with a runtime error: Unsupported redaction key '/RO'

Expected behavior (optional)

Describe what you expected to happen (if not obvious). The page's redaction annotations should be applied, and the underlying text should be removed.

Screenshots (optional)

If applicable, add screenshots to help explain your problem.

Your configuration (mandatory)

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

Additional context (optional)

Add any other context about the problem here. If this is a fundamental limitation because of the underlying library (MuPDF), is there a way to safely remove the key /RO from the annotations (through our Python code) so that we can successfully call page.apply_redactions()?

JorjMcKie commented 4 years ago

No this works as intended / designed ... and documented! The /RO is not for overlay text, but for an image or similar to be placed in the annot rect.

ishineko commented 4 years ago

Thanks for the clarification. I misspoke about the overlay text, @JorjMcKie ... in that case, is there a way that we can remove the key programmaticaly?

JorjMcKie commented 4 years ago

Hm, currently there only are two ways:

  1. dirty: get the object definition string of the annot and remove the line containing the /RO key: o = doc.xrefObject(annot.xref). This string is layouted in a clean way, so it should be ease to learn how the line with /RO looks like. After removing it, execute doc.updateObject(annot.xref, new_o).
  2. cleaner: remove the annot and enter a new redact annot in the same place (same rectangle and any other value you need from the old redact).
ishineko commented 4 years ago

Awesome, thanks for the quick response!

I'll try the dirty way because the pages are landscaped, and when removing the original redact annot and adding new ones, the new annot text shows sideways.

Thank you very much for all your help. This is really an awesome library!

JorjMcKie commented 4 years ago

the new annot text shows sideways.

Probably caused by one of those famous non-wrapped "geometry changes" (see docu). Try page.wrapContents() before you insert new stuff.

JorjMcKie commented 4 years ago

... and maybe I shouldn't be so picky about the /RO and just ignore it with a warning ...

ishineko commented 4 years ago

the new annot text shows sideways.

Probably caused by one of those famous non-wrapped "geometry changes" (see docu). Try page.wrapContents() before you insert new stuff.

Just tried it. Didn't fix the issue, but it's cool. I'll just modify the original redact annot through the xref trick. Thanks!!

... and maybe I shouldn't be so picky about the /RO and just ignore it with a warning ...

This would be fantastic, if you ever go that route. I upvote this idea! 👍

JorjMcKie commented 4 years ago

... and maybe I shouldn't be so picky about the /RO and just ignore it with a warning ...

just builtin the fix - easy-peasy. If you want I can send you a highly inofficial wheel, so you need not trick around ... and test that no bad things happen downstream.

JorjMcKie commented 4 years ago

Just tried it. Didn't fix the issue, but it's cool.

Any issue sending me that PDF? Would like to learn what is going on there ...

JorjMcKie commented 4 years ago

If you rename the ZIP extension back to "whl" and execute python -m pip install -U PyMuPDF-1.17.6-cp37-cp37m-win_amd64.whl you will have a version which just warns on /RO keys ... PyMuPDF-1.17.6-cp37-cp37m-win_amd64.zip

ishineko commented 4 years ago

Just tried it. Didn't fix the issue, but it's cool.

Any issue sending me that PDF? Would like to learn what is going on there ...

Thanks for the offer! Regrettably these are clinical-trial data PDFs, so we can't share them. I know, it makes it that much harder to figure what's going on when things don't work out :-(

ishineko commented 4 years ago

If you rename the ZIP extension back to "whl" and execute python -m pip install -U PyMuPDF-1.17.6-cp37-cp37m-win_amd64.whl you will have a version which just warns on /RO keys ... PyMuPDF-1.17.6-cp37-cp37m-win_amd64.zip

Dude! You're a gentleman and a scholar! It worked like a charm!! 🥇

Thank you very much for your flexibility and quick thinking!

JorjMcKie commented 4 years ago

muchas gracias por los cumplidos