Open dhdaines opened 1 month ago
Another note: PDF 1.7 specifies (page 367), with respect to the names of destinations:
The keys in the name tree may be treated as text strings for display purposes.
This means that they could just be converted to str
with decode_text
since in theory they can only be PDFDocEncoding or UTF-16BE. (in practice they are almost certainly other things as well...)
The
get_dest
method ofPDFDocument
is defined as:Unfortunately what this means in practice is that for PDF 1.1 documents, it takes a
str
, while for PDF 1.2 documents, it takes abytes
. This is because in PDF 1.2 and later the destination dictionary is not a dictionary but a name tree, and (PDF 1.7, page 88):What this means in practice is that while
pdfminer.six
(dubiously) converts the keys of a dictionary tostr
(because they are name objects and thus kinda-sorta UTF-8, since PDF 1.2, see PDF 1.7 page 16), it cannot reasonably do this for the keys of a name tree as they are undifferentiated blobs of 8-bit data. In practice they can and will be various things including UTF-16 with a BOM (see theEmbeddedFiles
in https://github.com/pdfminer/pdfminer.six/blob/master/samples/contrib/issue-625-identity-cmap.pdf).This means that
get_dest
isn't really very useful since you have to know what the named destination is and how it's encoded before you can look it up. A better approach would be to allow the user to iterate over the destinations.