Additional PDF libraries for TestGrammar C++ PoC

petervwyatt commented 2 years ago

Since each PDF library seems to have its own set of nuances, having a wider choice might show up further PDF file malformations and non-compliances, or even allow for additional checks.

Other multi-platform C/C++ PDF SDKs to consider:

updated pdfium (although the internal interfaces currently being utilized seem to have changed)
QPDF
MuPDF
PoDoFo

ceztko commented 2 months ago

I don't want to create expectations here but I'm working hard to have a 1.0.0 PoDoFo release later on this year which will feature a stable API. That will be a good moment to evaluate PoDoFo.

petervwyatt commented 2 months ago

Great! I also hope to soon write up an "experience report" article on some of the lesser supported aspects needed by tools that might do deep file validation based on the Arlington PDF Model (e.g. knowing if something was an indirect reference or not, knowing if duplicate keys were present, knowing if a string was a hex-string or not, etc.).

ceztko commented 2 months ago

Good! For the 1.0 stable API, I'm actually cutting with axe a lot of internal APIs which are not pretty enough to be exposed publicly, but after 1.0 it's certainly possible to investigate if more details about the parsing process can be exposed. Of the things you mentioned here is the PoDoFo status:

[X] Knowing if something was an indirect reference or not
[ ] Knowing if duplicate keys were present (in a PdfDictionary). [Partial] Today indexing is done on raw value (escaped string): his can be detected by comparing the raw data with the unescaped/expanded string (both are kept). The entries are stored in a std::map that doesn't allow for multiple exact same keys entries. Could be migrated to std::multimap allowing to detect even this situation
[X] Knowing if a string was a hex-string or not

A comprehensive list in the article will certainly help.

petervwyatt commented 2 months ago

A few more off the top of my head and without a lot of detail (some of this may also be in the documentation so you know what to expect from the API - vs. having to work it our heuristically or getting a surprise 😀). Some of this is obtuse stuff but important if doing detailed low-level validation:

support for trailer dictionary as a "normal" dictionary (since no object ID) - so can iterate all keys, support private keys, etc.
knowing if a string object is Unicode or not (i.e. can get at the BoMs and BCP-47 language markers, etc)
getting to the exact raw bytes of a string object (not re-encoded in UTF-8 or whatever is standard for the programming language; with escape sequences in-situ)
functionality/API still works even with unknown/unsupported encryption, since that only impacts strings and streams but the rest of the PDF objects are still functional and the DOM is navigatable
duplicate keys: this is more complicated than it sounds, since (for example) /JS, /J#53, /#4aS and /#4a#53 are all the SAME key technically yet differently semantically. And being able to access each of these separately...
getting to the exact raw bytes of name objects (with #-hex codes, for example vs treated as UTF-8 or de-escaped)
treatment of keys that have explicit null values - can these still be seen/accessed? I know the spec says to treat as non-existant but knowing a key is present vs not can be important
handling of objects with object numbers > trailer Size entry? Are these hidden? Still accessible via API?
how revisions (incremental updates) of files are handled (is it possible to access the trailer of each revision? access old or deleted objects that are still present in the file? etc)
documentation on what "version" means in the API - is it just the header comment? Also the DocCatalog Version entry? What about if a revision (incremental updates) are present? Can the DocCatalog Version entry also be extracted independently?
access to Linearization objects (technically not linked into the PDF DOM)
how are "hybrid reference PDFs" processed? Can they be processed as either a pre-PDF 1.5 processor (without any cross-reference and object stream support) and/or post-PDF 1.5 processor (with cross-reference and object stream support)

ceztko commented 2 months ago

duplicate keys: this is more complicated than it sounds, since (for example) /JS, /J#53, /#4aS and /#4a#53 are all the SAME key technically yet differently semantically. And being able to access each of these separately...

Ah, I understood duplicated keys in XRef sections/streams, not keys in a dictionary. In a PoDoFo PdfDictionary today indexing is done on raw value (escaped string): his can be detected by comparing the raw data with the unescaped/expanded string (both are kept). The entries are stored in a std::map that doesn't allow for multiple exact same key entries. Could be migrated to std::multimap allowing to detect even this situation (at least iterating all pairs stored in the dictionary).

pdf-association / arlington-pdf-model

Additional PDF libraries for TestGrammar C++ PoC #18