pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.98k stars 932 forks source link

`get_dest` works differently for PDF 1.2 vs PDF 1.1 #1053

Open dhdaines opened 1 month ago

dhdaines commented 1 month ago

The get_dest method of PDFDocument is defined as:

    def get_dest(self, name: Union[str, bytes]) -> Any:

Unfortunately what this means in practice is that for PDF 1.1 documents, it takes a str, while for PDF 1.2 documents, it takes a bytes. This is because in PDF 1.2 and later the destination dictionary is not a dictionary but a name tree, and (PDF 1.7, page 88):

A name tree serves a similar purpose to a dictionary—associating keys and values—but by different means. A name tree differs from a dictionary in the following important ways: • Unlike the keys in a dictionary, which are name objects, those in a name tree are strings.

What this means in practice is that while pdfminer.six (dubiously) converts the keys of a dictionary to str (because they are name objects and thus kinda-sorta UTF-8, since PDF 1.2, see PDF 1.7 page 16), it cannot reasonably do this for the keys of a name tree as they are undifferentiated blobs of 8-bit data. In practice they can and will be various things including UTF-16 with a BOM (see the EmbeddedFiles in https://github.com/pdfminer/pdfminer.six/blob/master/samples/contrib/issue-625-identity-cmap.pdf).

This means that get_dest isn't really very useful since you have to know what the named destination is and how it's encoded before you can look it up. A better approach would be to allow the user to iterate over the destinations.

dhdaines commented 1 month ago

Another note: PDF 1.7 specifies (page 367), with respect to the names of destinations:

The keys in the name tree may be treated as text strings for display purposes.

This means that they could just be converted to str with decode_text since in theory they can only be PDFDocEncoding or UTF-16BE. (in practice they are almost certainly other things as well...)