sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

WG21 P1854: Source to Execution encoding conversion should not lead to loss of information #50

Open cor3ntin opened 5 years ago

cor3ntin commented 5 years ago

When converting as string literal or wide string literal (or character) from the source to execution encoding, it is implementation defined how non-representable characters are handled, which can lead to loss of data.

In practice, most compilers make that ill-formed https://godbolt.org/z/SlhCdr

The standard should match existing practice and not encourage implementation to be able to modify the meaning of string literals

http://eel.is/c++draft/lex#phases-1.5

Each basic source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) characterthe program is ill-formed.

Note: the above paragraph needs further modifications as per #46

tahonermann commented 5 years ago

In practice, most compilers make that ill-formed

The diagnostic for that example is awful. Instead of stating that the character lacks representation in the presumed execution encoding, it states that it is "invalid" (whatever that means), or is an "incomplete multibyte or wide character". I have no idea what an "incomplete wide character" might be.

Substituting a \u1234 escape sequence for the non-representable character produces a similar diagnostic.

The standard should match existing practice and not encourage implementation to be able to modify the meaning of string literals

The linked godbolt example only demonstrates behavior for a single compiler. The proposed change is not existing practice for some other compilers. In particular, the Microsoft compiler will silently substitute a replacement character. The claim that the proposed change reflects the behavior of "most compilers" is unsubstantiated.

That being said, I think I can get behind this proposed change. Implementations can always offer an extension to substitute replacement characters in the (very few) cases where that is desirable.

cor3ntin commented 5 years ago

TBH i wasn't able to make clang accept anything but utf8 as input encoding

tahonermann commented 5 years ago

LLVM Clang (and common derivatives like Apple Clang and Android Clang) only support UTF-8. There are derivatives that do support other encodings though (e.g., the z/OS Clang ports).

tahonermann commented 4 years ago

P1854 was submitted with a proposed fix for this issue and was discussed by SG16 in Belfast. This is now waiting on an updated paper.

tahonermann commented 4 years ago

This issue is now tracked by https://github.com/cplusplus/papers/issues/608.

peter-b commented 2 years ago

@cor3ntin I don't think we ever polled Proposal 7 from P2178, and there doesn't seem to be a current paper that plugs this silent data loss hole. Do we need a new paper, or a new revision of P1854?

cor3ntin commented 2 years ago

P1854 will be revised

On Thu, Sep 16, 2021, 16:41 Peter TB Brett @.***> wrote:

@cor3ntin https://github.com/cor3ntin I don't think we ever polled Proposal 7 from P2178, and there doesn't seem to be a current paper that plugs this silent data loss hole. Do we need a new paper, or a new revision of P1854?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sg16-unicode/sg16/issues/50#issuecomment-920964048, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKX764CFESHCI5JYZ42QETUCH6YFANCNFSM4II4SRSQ .