pdf-association / arlington-pdf-model

A vendor- and implementation-independent specification-derived, machine-readable model of PDF.
Apache License 2.0
74 stars 6 forks source link

Identify hybrid-reference PDFs #19

Open petervwyatt opened 2 years ago

petervwyatt commented 2 years ago

TestGrammar (C++ PoC) currently only reports on traditional style xref tables or cross-reference streams. Should expand to also identify hybrid reference PDFs, even though they are relatively rare:

7.5.8.4 Compatibility with applications that do not support compressed reference streams

A hybrid-reference PDF file is readable by PDF processors designed only to support versions of PDF before PDF 1.5. Such a PDF file contains objects referenced by standard cross-reference tables in addition to objects in object streams that are referenced by cross-reference streams.

petervwyatt commented 2 years ago

Thanks @tballison for the question that prompted this!

petervwyatt commented 1 year ago

Having trouble working out how to do this with all PDF SDKs...

petervwyatt commented 1 year ago

Can be identified by the presence of XRefStm key as per Table 19 (and Note below Table 15).

Fix for Issue #39 means that pdfium will now report:

...
       1:   Trailer (as XRefStream)
Info: unknown key 'XRefStm' is not defined in Arlington for XRefStream in PDF 1.7

PDFix does not report anything currently.

petervwyatt commented 1 year ago

Example hybrid PDF: https://www.ema.europa.eu/documents/product-information/rapamune-epar-product-information_en.pdf (has other issues also, just to add to the fun 😁)