sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

String literal concatenation in translation phase 5 and 6 contradicts [lex.string] #47

Closed tahonermann closed 2 years ago

tahonermann commented 5 years ago

Translation phase 5 states in [lex.phases]p5:

Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.

Translation phase 6 states in [lex.phases]p6:

Adjacent string literal tokens are concatenated.

The order of these operations contradicts [lex.string]p12 which states:

... If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand. ...

The following example is included in table 9:

Source Means
u"a" "b" u"ab"

The wording in translation phases 5 and 6 does not produce the intended result in the above table since "b" will have already been converted to the execution character set prior to concatenation.

Clang and gcc produce the expected results according to [lex.string]p12. However, MSVC follows translation phase 5 and 6 and first converts the ordinary string literal to the execution character set (and to '?' if not representable), and then concatenates the result encoded as UTF-16.

Additionally, when MSVC is invoked with the /utf-8 option, The UTF-8 encoded code units in the ordinary string literal are individually interpreted according to the active code page (generally Windows-1252) and then re-encoded as UTF-16. This appears to be a defect in the MSVC compiler.

The above differences can be observed with Compiler Explorer: https://msvc.godbolt.org/z/vBdEXq

@steve-downey requested a core issue to be filed on the core mailing list (http://lists.isocpp.org/core/2019/03/5770.php). The new issue hasn't been opened yet, but is expected to be prior to the 2019 Cologne meeting.

tahonermann commented 3 years ago

This is now tracked by CWG 2455.

If adopted, P2314 will address this.

dimztimz commented 2 years ago

The paper P2314 got accepted, so what happens now with the CWG issue? Will it be pushed toward C++20?

jensmaurer commented 2 years ago

CWG 2455 should be closed. I've sent e-mail to Mike Miller.

tahonermann commented 2 years ago

Thank you, @dimztimz, for helping to keep our issue queue clean!

Closing this as resolved by the adoption of P2314 for C++23.