pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
64 stars 2 forks source link

Conflicts between xref table and xref stream in hybrid-reference files #237

Open bdoubrov opened 1 year ago

bdoubrov commented 1 year ago

Here is yet another issue with hybrid PDFs (see #115 and #146)

The text after Table 19 in clause 7.5.8.4 "Compatibility with applications that do not support compressed reference streams" says:

When a PDF reader opens a hybrid-reference PDF file, objects with entries in cross-reference streams are not hidden. When the PDF reader searches for an object, if an entry is not found in any given standard cross-reference section, the search shall proceed to a cross-reference stream specified by the XRefStm entry before looking in the previous cross-reference section (the Prev entry in the trailer). NOTE Hidden objects, therefore, have two cross-reference entries. One is in the cross-reference stream. The other is a free entry in some previous section, typically the section referenced by the Prev entry. A PDF reader shall look in the cross-reference stream first, find the object there, and shall ignore the free entry in the previous section.

This wording does not specify what the PDF reader should do in the following use case:

Should such object be treated as deleted (so the reference to it is equivalent to null) or as real one as specified by the xref stream?

As a side note, the term hidden object is not defined anywhere, but used 5 times in this clause. From the context it probably means the objects that are located in the compressed object streams and referenced from the xref streams, which makes them "hidden" from the PDF 1.4 parsers.

MatthiasValvekens commented 1 year ago

I haven't double-checked this, but I believe the xref table in the revision containing the hybrid stream would still take precedence over the stream content---I've always interpreted the hybrid reference mechanism as "inserting" an extra stream in between the current and the previous revision. If that interpretation is correct, then the f marker in the table "wins", and the object should be considered null.

I can see this arising in two cases:

FWIW, I'm not aware of any processors that do either of this, so maybe my intuition is completely wrong. The "real-world" hybrid reference files that I've seen lately (=> MS Word) tend to use an empty xref table in the second/"hybrid" revision IIRC.

mkl-public commented 9 months ago

@MatthiasValvekens believes correctly, see the section 7.5.8.4 in ISO 32000-2 already quoted by @bdoubrov:

When the PDF reader searches for an object, if an entry is not found in any given standard cross-reference section, the search shall proceed to a cross-reference stream specified by the XRefStm entry before looking in the previous cross-reference section (the Prev entry in the trailer). NOTE [...] A PDF reader shall look in the cross-reference stream first, find the object there, and shall ignore the free entry in the previous section.

So you look for an object entry first in the newest cross reference table section. If it isn't there, you look into the XRefStm attached to that section. And if the object entry is not there either, you continue with the previous section, then the XRefStm of that section, and so on.

(In the quote the sentence after the note can be a bit misleading: If it is read as an isolated sentence, one may think that the XRefStm of a table section has to be consulted first, even before the section itself. Actually, though, the sentence is still in the situation of the sentence before the note - an entry is not found in a given standard cross-reference section - even if it is separated by a paragraph break and a note. This may be clarified.)

petervwyatt commented 8 months ago

I also see a few other problems here - mainly that the section title is "7.5.8.4 Compatibility with applications that do not support compressed reference streams", yet many of the wording and requirements apply to applications that DO support cross-reference streams. And it should be "cross-reference streams". Arguably people might ignore this section entirely based on the heading and thus miss critical requirements for PDF 1.5 later processors...

Things are also expressed in terms of a generic "PDF reader" yet the intention of the words and the example I believe are to explain how PDF writers can create hybrid-reference PDFs that are compatible with both pre-PDF 1.5 readers and PDF 1.5 and later readers that do support cross-reference streams. This also explains what I think is meant by the term "hidden objects", in that those are objects only visible to PDF 1.5 and later readers that support cross-reference streams and are thus "invisible" to pre-PDF 1.5 processors - the structure tree (object 3) in the case of the example.