pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.94k stars 930 forks source link

Clipping paths implementation #414

Open kelvin0 opened 4 years ago

kelvin0 commented 4 years ago

Hi Everyone,

I've been using Pdfminer for the last few months, I really thing it's a very helpful codebase.

But recently I noticed that clipping paths do not seem to be implemented, I inspected: \pdfminer\pdfinterp.py

# clip
def do_W(self):
    return

# clip-even-odd
def do_W_a(self):
    return

The effect of this is that ALL text is extracted from the PDF, even text that should not be visible (since it should be clipped).

I am not a PDF expert but I can surely help implement the following features:

Hope I can clarify this and be able to contribute to the project if necessary.

pietermarsman commented 4 years ago

Hi @kelvin0, are you experiencing problems due to this issue? I assume that the clipping operator is more often used to exclude parts of a drawing, than it being used to exclude part of the text. Anyway, it would be nice to have a pdf to test this on.

If you want to start implementing this, have a look at section 4.4.3 of the pdf reference manual.

You should also adjust the PDFGraphicsState class. I think it is wise to assess the impact that adding the clipping path to PDFGraphicsState could have on all the other graphics-state aware operators.

Belval commented 4 years ago

Clipping path is indeed used to hide text in PDF documents, here is an example that could be used as a starting point: https://mva.maryland.gov/Documents/VR-181.pdf

There is hidden text slightly above the "VR-181 (03-18)". I was able to extract it properly with pdfbox, but not with pdfminer as path clipping is not supported.

pietermarsman commented 4 years ago

Feel free to create a PR. I can do reviews and merge it when ready.

I don't mind if the first implementation only focusses on adding clipping-path behaviour and ignoring additional top-level arguments for enabling/disabling the behavior. We can create another issue for that, if needed.

jstockwin commented 4 years ago

@kelvin0 Just a quick bump on this issue as we're trying to sort through them. Are you still willing to work on this? As commented above, a PR would be appreciated if you're still interested and able to.

dhdaines commented 3 months ago

Hi! I just ran into this issue as well. It specifically seems fairly common to use the clipping path to hide text in legal documents (academic documents often use the more prosaic method of setting the text colour to white). You can see this pretty clearly in https://www.legisquebec.gouv.qc.ca/fr/pdf/lc/C-1.pdf - on the first page (PDF object 5) there is a bunch of hidden text. The way the formatter in question (Antenna House 6.3) renders text is somewhat annoying to follow, but it appears that it simply sets the clipping path to something arbitrary which excludes the text in question, for example, at the top of page 1, the hidden text "CADASTRE":

% flip the transformation matrix so that 0, 0 is at the top of the page, then translate to
% set the margins, or something like that
q 1 0 0 -1 0 792 cm q 1 0 0 1 72 42.51968 cm 1 0 0 1 0 85.03937 cm 0 0 0 rg
% create a rectangle of height and width 0 and intersect with the clipping path using the
% even-odd winding rule (for no apparent reason) then move to its location
q 0 47.47323 0 0 re W* n 1 0 0 1 0 47.47323 cm
% render some text (that will not get rendered) using gratuitously arbitrary cid mapping
BT /F0 11 Tf 1 0 0 -1 -0.00001 8.88061 Tm<0026> Tj 0 -11.52 Td<0024> Tj 0 -11.52
Td<0027> Tj 0 -11.52 Td<0024> Tj 0 -11.52 Td<0036> Tj 0 -11.52 Td<0037> Tj 0 -11.52
Td<0035> Tj 0 -11.52 Td<0028> Tj ET
% restore normal graphics state
Q

Implementing the winding rules seems rather complicated though there are plenty of implementations out there that can serve as a reference.

dhdaines commented 3 months ago

Clipping path is indeed used to hide text in PDF documents, here is an example that could be used as a starting point: https://mva.maryland.gov/Documents/VR-181.pdf

There is hidden text slightly above the "VR-181 (03-18)". I was able to extract it properly with pdfbox, but not with pdfminer as path clipping is not supported.

Thanks for the example! In this case the clipping path is a simple rectangle and all the hidden text is placed outside that rectangle.

My first idea is to make a PR that minimally supports these two examples by deriving a visible rectangle from the clipping path and intersecting it with the bbox of characters when they are added to the layout - at that point the converter (or another library like pdfplumber) can call is_empty() on them to decide if they should be shown or not.

Edit: That seems like not such a great idea, actually, since objects that are clipped out are not in the layout by definition. If you want to get at them you could use the interpreter directly.