sg16-unicode / sg16

SG16 overview and general information
46 stars 5 forks source link

Clarify the relationship between the literal and execution encodings #72

Open tahonermann opened 3 years ago

tahonermann commented 3 years ago

The relationship between the literal (compile-time) and execution (locale dependent run-time) encodings is not clear in the standard. Intuition argues that the encodings used for the literal and execution encodings must be compatible; that any character encoded using the literal encoding must match its encoding in the execution encoding. But such a relationship is commonly violated. Corentin asked the following questions during discussion of P2297R0 in the 2021-02-24 SG16 telecon. Note that the first case uses a character from the basis source character set, but the second does not.

isalpha('a') // Can this ever return false?
isalpha('é') // Can this ever return false?

In principle, both of these can return false, and in practice, there are cases where even characters of the basic source character set lack representation in the execution encoding or are differently encoded. This happens with Shift-JIS where U+00A5 YEN SIGN (¥) is substituted for U+005C REVERSE SOLIDUS () relative to ASCII and in various EBCDIC code pages where the following basic source character set members are not part of the "invariant subset of EBCDIC"

Further complications arise when a program compiled with a literal encoding such as UTF-8 is run in an environment with, for example, a Windows-1252 execution encoding. In this case, lead and trail code unit bytes from the UTF-8 encoding are perceived to be individual characters. The situation is even worse for encodings like Shift-JIS where a trailing code unit sequence may contain code units that themselves match the encoding of a single byte encoded character.

Possibilities for addressing this include:

ThePhD commented 3 years ago

54's paper addresses this partially by actually separating the two in C. The paper notes that implementations do not actually keep these two in lock-step, and that it's technically impossible to do so without severe performance penalty, space penalty, or both.

My suggestion is the two should never be conflated. They must be separate. Then, the consequences of the two fall out naturally from not having identical encodings: if they're not the same, then the function "obviously" doesn't work, because the encodings aren't the same.

cor3ntin commented 3 years ago

FYI the paper I plan to write is your last bullet point

stating that a program that is run in an environment where the execution encoding is not compatible with the literal encoding exhibits undefined behavior if a string literal encodes a non-compatible character and that string is passed to a execution sensitive function.

Well, not exactly, rather if a string literal doesn't represent the same sequence of abstract character when interpreted with either the literal encoding or the execution encoding, then UB.

Stay tuned for the paper. Of course I am eager to see if that gather consensus, but f it does.. the wording is a bit challenging because most of the impacted library is in C and neither C nor C++ are super explicit at listing which functions are "character functions" nor what that entails.