Open petervwyatt opened 2 years ago
I don't want to create expectations here but I'm working hard to have a 1.0.0 PoDoFo release later on this year which will feature a stable API. That will be a good moment to evaluate PoDoFo.
Great! I also hope to soon write up an "experience report" article on some of the lesser supported aspects needed by tools that might do deep file validation based on the Arlington PDF Model (e.g. knowing if something was an indirect reference or not, knowing if duplicate keys were present, knowing if a string was a hex-string or not, etc.).
Good! For the 1.0 stable API, I'm actually cutting with axe a lot of internal APIs which are not pretty enough to be exposed publicly, but after 1.0 it's certainly possible to investigate if more details about the parsing process can be exposed. Of the things you mentioned here is the PoDoFo status:
PdfDictionary
). [Partial] Today indexing is done on raw value (escaped string): his can be detected by comparing the raw data with the unescaped/expanded string (both are kept). The entries are stored in a std::map
that doesn't allow for multiple exact same keys entries. Could be migrated to std::multimap
allowing to detect even this situationA comprehensive list in the article will certainly help.
A few more off the top of my head and without a lot of detail (some of this may also be in the documentation so you know what to expect from the API - vs. having to work it our heuristically or getting a surprise 😀). Some of this is obtuse stuff but important if doing detailed low-level validation:
/JS
, /J#53
, /#4aS
and /#4a#53
are all the SAME key technically yet differently semantically. And being able to access each of these separately...
- duplicate keys: this is more complicated than it sounds, since (for example)
/JS
,/J#53
,/#4aS
and/#4a#53
are all the SAME key technically yet differently semantically. And being able to access each of these separately...
Ah, I understood duplicated keys in XRef sections/streams, not keys in a dictionary. In a PoDoFo PdfDictionary
today indexing is done on raw value (escaped string): his can be detected by comparing the raw data with the unescaped/expanded string (both are kept). The entries are stored in a std::map
that doesn't allow for multiple exact same key entries. Could be migrated to std::multimap
allowing to detect even this situation (at least iterating all pairs stored in the dictionary).
Since each PDF library seems to have its own set of nuances, having a wider choice might show up further PDF file malformations and non-compliances, or even allow for additional checks.
Other multi-platform C/C++ PDF SDKs to consider: