pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.49k stars 443 forks source link

Invalid OCGs not ignored by SVG image creation #3569

Open JorjMcKie opened 2 weeks ago

JorjMcKie commented 2 weeks ago

Discussed in https://github.com/pymupdf/PyMuPDF/discussions/3567

Originally posted by **serhii-brovarnyk** June 11, 2024 Hello! I have a PDF file with only one page I got via another tool for PDF documents, and my PDF document has some OCGs. Unfortunately, I cannot provide the actual file. If I try to get the pixmap of the page, it is completely OK, but when I try to get an SVG image via `page.get_svg_image(text_as_path=False) ` method then the appearance of the page is completely different. Investigating the issue I`ve concluded that some of the clip-paths affect the appearance of the drawing that I see. The defs section does not have any relation to the layers or OCGS but some of the groups look like this: ``` ``` If I delete a certain clip-path in the defs section then I`ll get more visible content on the SVG image so I suppose the only reason that I get such a result is the SVG has some invisible data from some of the OCGs and since it does not being managed by the PDF I see it whether I suppose to see it or not. So my question is How to detect and delete invisible and unnecessary OCGs from my PDF document so I won`t see the difference between the SVG image and the pixmap that I got from the pymupdf Page object? It is important to notice, that the pymupdf Document object does not have any info about layers or OCGs. I have tried `doc.get_layers()`, `doc.get_ocgs()`, `doc.layer_ui_configs()` methods but they return empty lists. But `page.get_oc_items() ` returns such a list of OCGs: ``` > [('oc10', 68, 'ocg'), > ('oc1009', 67, 'ocg'), > ('oc1010', 66, 'ocg'), > ... > ('oc945', 7, 'ocg'), > ('oc946', 6, 'ocg'), > ('oc947', 5, 'ocg')] ``` Also, I used such a code ``` page_xref = doc.page_xref(0) xref_keys = doc.xref_get_keys(page_xref) for key in xref_keys: print(f"KEY: {key}") print(doc.xref_get_key(page_xref, key)) print('---------------') ``` To get such info: ``` > KEY: Contents > ('xref', '80 0 R') > --------------- > KEY: MediaBox > ('array', '[0 0 2160 3024]') > --------------- > KEY: Parent > ('xref', '82 0 R') > --------------- > KEY: Resources > ('dict', '<>/Font<>/ProcSet[/PDF/Text/ImageC]/Properties<>>>') > --------------- > KEY: Rotate > ('int', '270') > --------------- > KEY: Type > ('name', '/Page') > --------------- > KEY: VP > ('array', '[]') > --------------- ``` In conclusion, this document has some OCGs that are accessible only on the Page level. I want to preserve only visible OCGs to get the right appearance of the resulting SVG image and delete the rest. Can you give me some advice on how to do it? I have read 2 similar discussions (about OCGs) but eventually did not get the answer :(
julian-smith-artifex-com commented 1 week ago

Associated MuPDF bug is: https://bugs.ghostscript.com/show_bug.cgi?id=707824