sg16-unicode / sg16

SG16 overview and general information
45 stars 5 forks source link

`u8`, `u`, and `U` literals should always be Unicode #49

Closed cor3ntin closed 4 years ago

cor3ntin commented 5 years ago

Utf literals are converted from the source character set to the internal encoding then to utf8. If the source is not interpreted as utf8, this lead to both confusion and mojibake, and ultimately to not portable code.

Instead, u8, u and U literals should only appear in utf source and not be re-encoded in another way that utf-X to utf-Y (which is lossless), such that if a source is utf*, the representation of a u8 literal should be bitwise identical in source and execution encoding

Proposed wording http://eel.is/c++draft/lex#string-14 utf-8 literals, utf-16-literals, utf-32 literals can only appear in a source or header file if all code-points as defined per in ISO/IEC 10646 have a representation in the set of physical source file characters, unless all elements in the code units array formed by interpreted the string literal has a representation in the basic character set

Because this is a ""breaking"" change for u and U, it might be preferable to add something to annex D.

See an illustration of the issue https://godbolt.org/z/QA4vV2

I think doing something along this lines will reduce the number of missus of the feature, improve education and overall put us in a better place to deal with Unicode in source files going forward.

We need to do something equivalent for #48 Yes, there will be a different set of features in utf-8 files and non utf-8 files. But that because we shouldn't try to put Unicode in non Unicode files

tahonermann commented 5 years ago

I'm strongly against this proposal for a number of reasons.

First, it doesn't actually solve the problem. The compiler has to follow a protocol of some sort to determine what encoding to use for a given source file. If it is mislead (as occurs in two of the linked Godbolt examples), then bad things will happen regardless.

Second, it is a breaking change with little recourse for addressing it. Requiring programmers to convert all of their source code to a UTF encoding would be a large effort. For non-ASCII based platforms, doing so is a non-starter.

Third, UTF literals are useful even in source files that only use characters from the basic source character set. "\u1234" does not mean the same thing as u8"\u1234".

In my opinion, the primary factors that lead to the types of problems this proposal purports to address are:

  1. Misunderstanding what encoding a given compiler will use for a given source file.
  2. Differences in default behaviors across compilers.
  3. An Inability for source files to indicate their encoding to the compiler (other than via a Unicode BOM).

I think we should focus on the third factor listed there. A standard way to specify the encoding of a source file (as is possible with the IBM xlC compiler, and in Python and HTML) could actually lead to greater portability of source files without having to migrate the world all at one time.

cor3ntin commented 5 years ago

First, it doesn't actually solve the problem. The compiler has to follow a protocol of some sort to determine what encoding to use for a given source file. If it is mislead (as occurs in two of the linked Godbolt examples), then bad things will happen regardless.

The godbolt example is reflective of reality. By default GCC will assume utf8, msvc will not. Code ported from gcc to msvc will therefore silently break.

I fully agree with you we might need a better protocol.

Second, it is a breaking change with little recourse for addressing it. Requiring programmers to convert all of their source code to a UTF encoding would be a large effort. For non-ASCII based platforms, doing so is a non-starter.

I'd be happy with deprecating it - a warning would let people know they may have moji-bake

Third, UTF literals are useful even in source files that only use characters from the basic source character set. "\u1234" does not mean the same thing as u8"\u1234".

Agreed

In my opinion, the primary factors that lead to the types of problems this proposal purports to address are:

Misunderstanding what encoding a given compiler will use for a given source file.

Agreed. But encoding being a property of the source file there are little solutions beside assuming utf8 or adding #pragma utf8 in every single file.

Differences in default behaviors across compilers.

Agreed, little we can do there about that

An Inability for source files to indicate their encoding to the compiler (other than via a Unicode BOM).

Agreed, but utf8 is the right default for most people

cor3ntin commented 5 years ago

msvc might be open to have /utf8 has a default someday, i should inquiry more people about that :smile:

tahonermann commented 5 years ago

Code ported from gcc to msvc will therefore silently break.

Only if it contains non-ASCII characters of course.

I'd be happy with deprecating it

I would not be :)

But encoding being a property of the source file there are little solutions beside assuming utf8 or adding #pragma utf8 in every single file.

I agree, but unfortunately we’re not in the position to do a clean room design. We have to consider the migration path.

msvc might be open to have /utf8 has a default someday

I would put good money against that happening :)

tahonermann commented 4 years ago

Given our discussion of P1879R0 in Belfast, I'm going to proceed with closing this issue. If a different solution than was proposed in P1879R0 is identified, than we can reopen this issue.