pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.16k stars 496 forks source link

Sequence numbers might not be right for rectangles #1387

Closed mlissner closed 2 years ago

mlissner commented 2 years ago

Describe the bug (mandatory)

I'm continuing my work on finding bad redactions and I've found a document that isn't making sense to me:

https://storage.courtlistener.com/recap/gov.uscourts.akd.62782/gov.uscourts.akd.62782.121.1.pdf

As usual, I'm trying to identify text that's under rectangles. In this case, my program is reporting that the date and address that are visible in the upper margin is under a white rectangle.

According to PyMuPDF, and unless I'm mistaken, the white rectangle has a seqno of:

8

And the date text has a sequence number of:

1

So it should be hidden. But for some reason when you open the PDF it's not. This makes me wonder if the sequence number reported by PyMuPDF is wrong, or if there's another attribute that we're missing somehow that would clarify this.

I tried looking at the raw PDF stream of the document, and a colleague did too, but I at least couldn't make sense of it. Maybe you have wisdom to share about inspecting PDFs.

To Reproduce (mandatory)

  1. Get the rectangles in the linked PDf.
  2. Get the text from the PDF
  3. See which ones intersect with the text and have a higher sequence number

Expected behavior (optional)

The answer to number three should be "None of the rectangles have a higher seqno, because the text is plainly visible and not under a rectangle."

Your configuration (mandatory)

Linux, Ubuntu, 64 bit

3.8, 64 bit

1.19.0

Additional context (optional)

Thank you as always! I hope this isn't another corner case!

JorjMcKie commented 2 years ago

Thank you as always! I hope this isn't another corner case!

It's no bug: all the sequence numbers are correct. It is another special case 😟: all the rectangles on this page are wrapped in so-called clipping paths. Clipping paths with ... (fasten seatbelt) ... an own color filling rule: even-odd, which overrules the non-zero winding rules of the single paths under their control. The text lies in an area where 4 white rectangles overlap, each of which are under control of a different clipping path with even-odd rules. So that area is treated as uncolored.

This would in theory of course not preclude that the shapes inside any of those clipping paths contain more clipping paths (they can be nested) or shape that color / uncolor subareas because of e.g. an own non-zero winding rule.

I am currently not able to reproduce this situation. I am also not sure whether this is even possible given the current API ...

I am taking the liberty to mark this issue as enhancement one more time. Keep you posted.

mlissner commented 2 years ago

So that area is treated as uncolored.

Wow. This analysis is just amazing. If you have tips for learning how to do what you do, I'd love to learn more.

In the meantime, I've found a pretty good solution for most of my problems, which is to just ignore all rectangles that aren't black. Usually bad redactions are black, but not always.

Anyway, thank you again for this. Should I continue sending you bizarro corner cases like this?

JorjMcKie commented 2 years ago

Wow. This analysis is just amazing.

Thank you again for your compliments! It was easier than it seems: There is MuPDF's batch utility mutool trace input.pdf ...which outputs MuPDF's analysis of the page's appearance commands. It is in XML format, its output is directed to the console, so needs to be captured to file. That trace uses the same data source as my page.get_drawings(). In contrast to my efforts, that utility just outputs a 1:1 version in XML text format. With the same data, I am building Python lists and dictionaries instead, collapsing multiples of 3 or 4 connected lines to rectangles and quads. I came to the said conclusion, when I saw that the text of that date is contained in exactly 4 rectangles, all white, all of clockwise orientation. the first is under the text, the other three above it. So it should have been invisible - as you and your friends have concluded too. Looking at the mutool trace then showed that those rectangles are under the control of separate clipping paths (which I ignore so far), which each have "even-odd" defined, so their intersection at the text is even => no fill color.

Should I continue sending you bizarro corner cases like this?

Please definitely do so! You and your team are the most avid and advanced users of that method. And, as mentioned once, I take pride in knowing that this work is helping to propel a good initiative.

Usually bad redactions are black, but not always.

I thought so, too. It is tempting to accept the situation as it is. Suppose I find a way to (more) faithfully reflect what is happening in the page's command source: will it be understandable, usable or even important? Anyway - I am a perfectionist, so I will continue thinking about that.

mlissner commented 2 years ago

There is MuPDF's batch utility mutool trace input.pdf

Ah ha! That does make it easier to see. You used this in #1355 too. I should caught that. Thank you.

Suppose I find a way to (more) faithfully reflect what is happening in the page's command source: will it be understandable, usable or even important?

It honestly kind of depends how easy the API is. If you gave me info about clips and winding rules in the get_drawings response, that'd be good, but probably not enough. It seems like going from clip and winding rule info to "this text is visible" is really hard and I don't think I'll have the time to do it.

On the other hand, if there was an attribute on every char for is_visible_to_humans, well, I'd use that in a heartbeat. Obviously, that's well outside your scope though.

The other approach I'm looking at is to render pixmaps of each place that looks like a bad redaction. Once I have the pixmap, I can check if it's all one color (indicating a problem) or if it's multiple colors (indicating a mixture of text on some other background). It looks like if I use the memoryview for this, it might not be too bad.

JorjMcKie commented 2 years ago

Resolved by version 1.19.2.