py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.38k stars 1.41k forks source link

_list_attachments() doesn't take into account /Kids #2087

Open alexis-via opened 1 year ago

alexis-via commented 1 year ago

In pypdf main branch: in _reader.py, there is a mistake in the method _list_attachments() https://github.com/py-pdf/pypdf/blob/main/pypdf/_reader.py#L2201

The current implementation of the method _list_attachments() only looks for attached files under /Names/EmbeddedFiles/Names. But attached files can be found under several locations:

  1. /Names/EmbeddedFiles/Names
  2. /Names/EmbeddedFiles/Kids/Names
  3. /Names/EmbeddedFiles/Kids/Kids/Names (it's not clear to me how many levels of /Kids you can have...)

Source : PDF Reference 1.7 section 3.8.5 "Name Trees" Extract : << A name tree is constructed of nodes, each of which is a dictionary object. [...] The nodes are of three kinds, depending on the specific entries they contain. The tree always has exactly one root node, which contains a single entry: either Kids or Names but not both. If the root node has a Names entry, it is the only node in the tree. If it has a Kids entry, each of the remaining nodes is either an intermediate node, containing a Limits entry and a Kids entry, or a leaf node, containing a Limits entry and a Names entry.

Note that supporting /Kids is not specific to /EmbeddedFiles ; it should be supported when parsing any Name tree.

For example, Ghostscript generate PDF with attachments using location /Names/EmbeddedFiles/Kids/Names. So, when you open such a file with evince or Firefox any other PDF viewer that support attachment, you will see the attachment. But, when parsing the same file with pypdf, _list_attachments() will not return any result. Pointer to the Ghostscript source code that generate /EmbeddedFiles with /Kids: https://git.ghostscript.com/?p=ghostpdl.git;a=blob;f=devices/vector/gdevpdf.c;h=c76f8238654e21765a0da86ece25233cb13b6064;hb=HEAD#l2969

I have such a PDF file with attachment with /Names/EmbeddedFiles/Kids/Names generated by ghostscript, but it's an invoice that contain private information. I can try to anonymize it and share it here (or I can share it privately with a developer).

The method _list_attachments() was added in this PR by @pubpub-zz https://github.com/py-pdf/pypdf/pull/1611

alexis-via commented 1 year ago

I spent more time reading the source code of pypdf today and I didn't find any code that properly handle Name trees with the possibility to have /Kids or /Kids/Kids. This problem is not just for _list_attachments(), but for all code that parse Name trees. By the way, I also discovered a bug in add_attachment() which is also linked to an implementation of Name trees that don't fully comply with the PDF standard, cf https://github.com/py-pdf/pypdf/issues/2090 My feeling is that the best solution would be to add support for Name trees in generic/_data_structures.py But, in the meantime, we can try to hack the code in _list_attachment()

alexis-via commented 1 year ago

FYI, here is my code in the factur-x lib that list attachments and supports /Kids https://github.com/akretion/factur-x/blob/master/facturx/facturx.py#L247 see methods _get_embeddedfiles() and _parse_embeddedfiles_kids_node() The code is certainly not as good as it should, but it has been working so far.

pubpub-zz commented 1 year ago

@alexis-via Can you provide a file that has this structure, in order to build the test?

pubpub-zz commented 1 year ago

@alexis-via +1?

pubpub-zz commented 1 year ago

@alexis-via still waiting for a test file with some /kids data I'm also looking for a file where an array of files attached (using /RF instead of /EF)