pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
63 stars 2 forks source link

Hybrid-reference PDFs should be deprecated #115

Open petervwyatt opened 2 years ago

petervwyatt commented 2 years ago

A comment/discussion arising from the ISO TC 171 SC 2 WG 8 "Securing PDF" discussion group:

Hybrid-reference PDFs as defined in clause 7.5.8.4 "Compatibility with applications that do not support compressed reference streams" should be deprecated. This is because (a) no current PDF writes them anymore; and (b) all PDF readers provide support cross-reference streams and compressed object streams.

Note also the very specific definition of deprecated in ISO 32000-2:2020, Term 3.15:

deprecated a part of ISO 32000 that should not be written into a PDF 2.0 document, and should be ignored by a PDF processor (3.49) Note 1 to entry: In some cases variations on these restrictions on continued use of a deprecated feature are explicitly stated in this document. Note 2 to entry: Implementers are cautioned that some features that are deprecated in this document could have tighter constraints placed on them, or even be removed completely, in a later version of ISO 32000, or in subset standards such as PDF/X (ISO 15930), PDF/A (ISO 19005), PDF/E (ISO 24517), PDF/VT (ISO 16612-2 and ISO 16612-3) and PDF/UA (ISO 14289).

MatthiasValvekens commented 2 years ago

I have a couple of questions related to the logistics of deprecating this. Perhaps I'm reading into the word "ignore" too much, but from my (admittedly limited) experience with hybrid reference files, not special-casing the XRefStm entry when doing incremental updates can cause weird things to happen. In the updated document trailer, that entry would be blindly copied over (in accordance with the provisions of 7.5.6), which would then potentially affect the way a hybrid-aware reader processes these cross-reference sections---especially with multiple such updates! I can elaborate on a real-world problem case I've witnessed, if needed.

How would a future PDF processor handle that? I don't think it's possible to reliably update a hybrid-reference file without actually processing these hybrid xref sections. Do we recommend against trying to perform incremental updates on hybrid-reference files? Allow overriding the XRefStm entry with null to indicate that the update doesn't come with an extra XRef stream? I'm just thinking out loud here.

Unfortunately I don't think this question is purely academic either... I recently had some hybrid-reference PDF files sent to me produced in 2021 by a version of Microsoft Word ~from almost 10 years ago. In some areas, that's apparently still commonplace.~

EDIT: Never mind, apparently MS Word's default exporter still outputs hybrid reference files, see below.

Granted, that was in connection with one of my personal projects, and the end user in question wasn't a business user as far as I know. Perhaps the commercial world has done a better job of keeping with the times.

mkl-public commented 2 years ago

You don't have to go that far back, current MS Word 365 still exports to PDF using hybrid reference PDFs, at least the export I just triggered did. Thus, Peter's item "(a) no current PDF writes them anymore" unfortunately doesn't hold.

MatthiasValvekens commented 2 years ago

Huh, you're right! The MS Word install on my work laptop ships with an Adobe plugin (Acrobat PDFMaker) that doesn't exhibit this behaviour, but the stock "Export to PDF/XPS" option indeed does. Same for the export functionality in the in-browser version of Office 365. Interesting...

petervwyatt commented 2 years ago

Yuck. Dang.

petervwyatt commented 2 years ago

After discussion in the ISO TC 171 SC 2 WG 8 "Securing PDF" discussion group, re-opening this issue and labelling as a future enhancement. The DG members are of the opinion that deprecating hybrid-reference PDFs is appropriate when looking to the future of PDF 2.0 and beyond regardless that a specific vendor still generates them as all PDF processors are already PDF 1.5 aware. PDF 2.0 also clearly defines "deprecated" as "should not write, should not read" so this is not an issue for that vendor and it clearly indicates our intentions to the marketplace.