pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
66 stars 2 forks source link

Definition of octal codes in literal strings related to UTF-16BE encoding #494

Open jmlehton opened 4 days ago

jmlehton commented 4 days ago

We feel that there is a small unclarity regarding to literal strings in ISO 32000-2:2020 (and previous versions). In ch. 7.3.4.2 "Literal strings" Table 3, a single "\ddd" octal code is defined as a "character code". Isn't a "character code" something which maps to a character in a codepage in question? For strings encoded with UTF-16BE, a single octal code can not really be used as a mapping character code (i.e. \ddd does not map to a Unicode character). Of course this can be done with multiple octal codes, but the definition is about a single octal code \ddd. From this, it may be unclear for the reader whether it is possible to use octal coding to UTF-16BE encoded string with multiple-byte characters.

It is true that ch. 7.3.4.2 "Literal strings" also states that any 8-bit value can appear also with the octal "notation described". But this still can be understood so that "notation described" refers to defining (i.e. limiting) an octal code as a mapping character code, which leads to the original unclarity.

In future revisions, we suggest to reconsider or open the term "character code" used in octal codes and to give a short sentence about its usage with Unicode in a case where a single character requires multiple bytes.

petervwyatt commented 4 days ago

I cannot find the use of the phrase "character code" anywhere in 7.3.4 subclauses - and that would be incorrect. But I agree that the language is confusing since the word "character" is used for both the bytes comprising the string in the input PDF as well as what they mean once lexed/de-escaped:

7.3.4.1: "A string object shall consist of a series of zero or more bytes." 7.3.4.2: "A literal string shall be written as an arbitrary number of characters enclosed ..."

The correct terminology should be "characters" are what comprise the string in "raw PDF" (pre-lexing), but "bytes" are what they represent post-lexing. So an octal code \ddd in a literal string comprises up to 4 "characters" but presents a single "byte" in that string object. The interpretation of those bytes (such as in a specific encoding) is dependent on type definitions elsewhere in the spec, and according to 7.9.2 and Figure 7.

petervwyatt commented 4 days ago

Proposed solution:

The \ddd escape sequence provides a way to represent characters bytes outside the printable ASCII character set.

and

Since any 8-bit value may appear in a string (with proper escaping for REVERSE SOLIDUS (backslash) and unbalanced PARENTHESES) this \ddd notation provides a way to specify characters bytes outside the ASCII character set by using ASCII characters only. However, any 8-bit value may appear in a string, represented either as itself or with the \ddd notation described.

car222222 commented 3 days ago

I think that at least one change is also needed in 7.3.4.1:

The term “literal characters” is used (meaning “bytes”) there also so this probably needs to be changed/clarified here too:

As a sequence of literal characters enclosed in parentheses () (using LEFT PARENTHESIS (28h) and RIGHT PARENTHESIS (29h)); see 7.3.4.2, "Literal strings"

Also, maybe the final sentence of 7.3.4.1 could be expanded somewhat, to say that “7.9.1+2 explains the use of such “byte strings” to represent characters in string objects, using various character encodings including multi-byte schemes”.

Currently it is:

Subclause 7.9.2, "String object types" describes the encoding schemes used for the contents of string objects.

petervwyatt commented 3 days ago

I agree 7.3.4.1, 1st bullet should drop the word "literal" - it should just state "characters" so it is consistent with the terminology throughout 7.3.4.2:

  • As a sequence of literal characters enclosed in parentheses () (using LEFT PARENTHESIS (28h) and RIGHT PARENTHESIS (29h)); see 7.3.4.2, "Literal strings"

I think other errata we have already applied sufficiently cover encodings and the fact that the lexical form of a string object is orthogonal to any character encoding in string data - see from this point down https://pdf-issues.pdfa.org/32000-2-2020/clause07.html#H7.9.1

jmlehton commented 2 days ago

I cannot find the use of the phrase "character code" anywhere in 7.3.4 subclauses - and that would be incorrect.

The "character code" definition is in the last row of Table 3 (in ch. 7.3.4.2).

car222222 commented 2 days ago

That definition could be made more precise, and understandable, as follows:

\ddd 8-bit Character code ddd (3 octal digits)

(Assuming I interpreted it correctly 😄 !)

car222222 commented 2 days ago

I edited this last comment to make the following correction: ASCII changed to 8-bit

jmlehton commented 2 days ago

I would suggest:

\ddd Byte code ddd (3 octal digits)

I feel that the word "character" is somewhat problematic here. When octal coding is used for a UTF-16BE encoded string, then an octal code \ddd does not map to any character, but a byte of a multiple-byte character, since UTF-16BE has only 2- and 4-byte characters.

petervwyatt commented 2 days ago

Thanks. Table 3 proposed fix is quite simple - its a byte:

\ddd Byte with value ddd in octal

I would NOT add "(3 octal digits)" as that is incorrect - it can be 1-3 bytes as per normative text further down.

car222222 commented 1 day ago

OK with me.

jmlehton commented 1 day ago

Thanks. Table 3 proposed fix is quite simple - its a byte:

\ddd Byte with value ddd in octal

I would NOT add "(3 octal digits)" as that is incorrect - it can be 1-3 bytes as per normative text further down.

This is great. Thanks. And yes, "(3 octal digits)" would be incorrect.