pdf-association / arlington-pdf-model

A vendor- and implementation-independent specification-derived, machine-readable model of PDF.
Apache License 2.0
74 stars 6 forks source link

Additional PDF libraries for TestGrammar C++ PoC #18

Open petervwyatt opened 2 years ago

petervwyatt commented 2 years ago

Since each PDF library seems to have its own set of nuances, having a wider choice might show up further PDF file malformations and non-compliances, or even allow for additional checks.

Other multi-platform C/C++ PDF SDKs to consider:

ceztko commented 2 months ago

I don't want to create expectations here but I'm working hard to have a 1.0.0 PoDoFo release later on this year which will feature a stable API. That will be a good moment to evaluate PoDoFo.

petervwyatt commented 2 months ago

Great! I also hope to soon write up an "experience report" article on some of the lesser supported aspects needed by tools that might do deep file validation based on the Arlington PDF Model (e.g. knowing if something was an indirect reference or not, knowing if duplicate keys were present, knowing if a string was a hex-string or not, etc.).

ceztko commented 2 months ago

Good! For the 1.0 stable API, I'm actually cutting with axe a lot of internal APIs which are not pretty enough to be exposed publicly, but after 1.0 it's certainly possible to investigate if more details about the parsing process can be exposed. Of the things you mentioned here is the PoDoFo status:

A comprehensive list in the article will certainly help.

petervwyatt commented 2 months ago

A few more off the top of my head and without a lot of detail (some of this may also be in the documentation so you know what to expect from the API - vs. having to work it our heuristically or getting a surprise 😀). Some of this is obtuse stuff but important if doing detailed low-level validation:

ceztko commented 2 months ago
  • duplicate keys: this is more complicated than it sounds, since (for example) /JS, /J#53, /#4aS and /#4a#53 are all the SAME key technically yet differently semantically. And being able to access each of these separately...

Ah, I understood duplicated keys in XRef sections/streams, not keys in a dictionary. In a PoDoFo PdfDictionary today indexing is done on raw value (escaped string): his can be detected by comparing the raw data with the unescaped/expanded string (both are kept). The entries are stored in a std::map that doesn't allow for multiple exact same key entries. Could be migrated to std::multimap allowing to detect even this situation (at least iterating all pairs stored in the dictionary).