sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

Translation phase 5 converts `u8`, `u`, and `U` literals to the wrong character encoding #46

Closed tahonermann closed 3 years ago

tahonermann commented 5 years ago

[lex.phases]p5 states:

Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.

This wording is incorrect for u8, u, and U literals (and a little hand wavy for wide literals) since it states they are converted to the execution character set. They should be converted, respectively, to UTF-8, UTF-16, and UTF-32 (and the wide execution character set).

@steve-downey requested a core issue to be filed on the core mailing list (http://lists.isocpp.org/core/2019/03/5770.php). The new issue hasn't been opened yet, but is expected to be prior to the 2019 Cologne meeting.

tahonermann commented 3 years ago

This was addressed by the adoption of P2029R4 for C++23. The wording in the current draft for [lex.phases]p5 now states:

Each basic-c-char, basic-s-char, and r-char in a character-literal or a string-literal, as well as each escape-sequence and universal-character-name in a character-literal or a non-raw string literal, is encoded in the literal's associated character encoding as specified in [lex.ccon] and [lex.string].

And a core issue never did get opened for this.