pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
64 stars 2 forks source link

"pertinent entries" in ToUnicode CMap stream dictionary #462

Open seehuhn opened 2 weeks ago

seehuhn commented 2 weeks ago

Section 9.10.3 of the PDF-2.0 spec states

The only pertinent entry in the CMap stream dictionary (see "Table 118 — Additional entries in a CMap stream dictionary") is UseCMap, which may be used if the CMap is based on another ToUnicode CMap.

Table 118 lists the following entries as required: Type, CMapName, CIDSystemInfo. Does the above sentence mean that these entries are not required for ToUnicode CMaps? It would be great if the spec could clarify what the meaning of "only pertinent entry" is in this context.

DietrichSeggern commented 2 weeks ago

In my opinion yes, they are not required and do normally not make sense.

seehuhn commented 2 weeks ago

Some additional thoughts:

I agree that CMapName and CIDSystemInfo are not useful for ToUnicode CMaps.

Even if it turns out that the corresponding fields are not required in ToUnicode CMap stream dictionaries, probably Type should be required?

The only example of a ToUnicode CMap in the spec (Section 9.10.3, Example 2) does include the fields in question:

16 0 obj
<<
/Type /CMap
/CMapName /Adobe-Identity-UCS2
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS2) /Supplement 0 >>
/Length 433
>>
stream
...
endstream

(As mentioned in #344, I suspect that the CIDSystemInfo in the example may be wrong, though.)

petervwyatt commented 2 weeks ago

Rewording as follows may help distinguish between required keys (which are always required!) and the use of "pertinent":

In addition to the required entries, the only pertinent entry in the CMap stream dictionary ...

So clear that "pertinent" is not attempting to dismiss the required-ness of the other entries.

DietrichSeggern commented 2 weeks ago

But in any of the PDFs with a ToUnicode CMap that I was just looking at there is none of these entries. Attached is a PASS file taken from the veraPDF testsuite. veraPDF test suite 6-2-10-7-t01-pass-a.pdf Or am I missing something?

seehuhn commented 2 weeks ago

Inspired by @DietrichSeggern's comment I checked the PDF files on my laptop: the files contain a total of 60477 ToUnicode CMaps. Here is how often each key in the stream dicts occurs:

So only 30 out of 60477 ToUnicode maps I inspected included the fields in question.

petervwyatt commented 2 weeks ago

I was just following the bouncing ball of references... clearly not reflecting reality!

I guess the ToUnicode definition does says it is "A stream containing a CMap file..." and doesn't reference the CMap stream dictionary definition in Table 118, but its hard to tell if this legacy language and an explicit nuanced sentence. This is also what the 1st bullet near the end of 9.10.1 implies. The text is generally confusing CMap (the data syntax) with CMap (the PDF stream object).

So maybe in this specific case "pertinent" does mean the only key that you can expect to find in a ToUnicode stream dictionary is UseCMap since it is not a "CMap stream" but a "stream that is a (slightly tweaked) CMap".

If that is true, then the consistent method to correct this would be to add a new Table titled "additional entries in a ToUnicode stream dictionary" and list just UseCMap. This is how all other streams in 32K are defined that have special keys beyond the standard set for streams. That way it would be explicitly unambiguous. But maybe the other CMap stream dictionary keys (like Type) are optional... I really don't know so let's also ask @lrosenthol to do some PDF archeology since extant data doesn't always get things correct.