pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.96k stars 930 forks source link

Fix #903: Keep track of OCG for LTCurve, LTLine, LTRect #924

Closed hfmandell closed 4 months ago

hfmandell commented 11 months ago

Pull request

This PR fixes Issue 903 which was raised by me after encountering this problem.

Many vector PDFs have Optional Content Groups (OCGs), also referred to as layers. When extracting LTComponents like LTCurve, LTLine, and LTRect, one may find the need to keep track of which OCG the LTComponent is attributed to. This is accomplished by:

  1. Adding ocg attributes to LTCurve, LTLine, and LTRect in 'pdfminer/layout.py'
  2. Setting the ocg attribute in 'pdfminer/converter.py'
  3. Adding an ocg attribute to the PDFGraphicState object in 'pdfminer/pdfinterp.py'
  4. Setting the PDFGraphicState's ocg attribute in 'pdfminer/pdfinterp.py' when the vector graphic BDC command is encountered in the PDF's stream and ensuring the current ocg value is maintained even when the graphic state is restored with the vector graphic Q command.

How Has This Been Tested?

Please remove this paragraph with a description of how this PR has been tested. [TODO]

Checklist

hfmandell commented 10 months ago

Can you give an example of the content of the props in do_BDC() that you would like to use?

The props are not immediately obviously helpful, in that they are simply an alphanumeric string that is unique to the particular OCG. In testing this, I've seen props that clearly describe an OCG, such as "/oc13". Others are less clear and are not reminiscent of the acronym "OCG". They can be seen in the output of dumppdf.py for a given PDF, with the leading "/".

There's a bit more logic needed to be done to tie these props to the actual name of the OCG in the PDF, for example, the "Roads" layer of a layered PDF map. Still, this functionality of associating a PDF vector drawing with its props allows the user to categorize the LTCurves/Lines/Rects into their OCGs. A future MR could tie it directly to the PDF layer name.

pietermarsman commented 10 months ago

Thanks for the extra info. I see now why storing the OCG could be useful in some specific cases.

I've been reading 8.11 (Optional Content) from the PDF Reference, but find it quite tricky to understand. Do you happen to have a PDF that has optional content groups that you can share? That would help me to understand them.

As far as I understand now the properties of the BDC operator are also used for other purposes, not just OCG's. Therefore simply converting to string and storing it in the graphics state is not enough. E.g. the test PDF's have a couple of BDC's with a /P tag and some extra properties. I think these are unrelated to OCG's, but correct me if I'm wrong.

pietermarsman commented 4 months ago

Closing because no response. Feel free to reopen when extra info is available.