sg16-unicode / sg16

SG16 overview and general information
46 stars 5 forks source link

Requiring wchar_t to represent all members of the execution wide character set does not match existing practice #9

Closed tahonermann closed 2 years ago

tahonermann commented 6 years ago

5.13.3 [lex.ccon] p6 states: (http://eel.is/c++draft/lex.ccon#6)

... [ Note: The type wchar_­t is able to represent all members of the execution wide-character set (see [basic.fundamental]). — end note ]

6.7.1 [basic.fundamental] p5 states: (http://eel.is/c++draft/basic.fundamental#5)

Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. ...

However, on Windows, wchar_t is 16-bit and unable to represent all members of the execution wide character set. The standard should be updated to reflect existing practice.

cubbimew commented 6 years ago

Pretty sure the execution wide character set on Windows is the set formerly known as "UCS-2".

steve-downey commented 6 years ago

In reality, not in ages. As I understand, UCS2 went away in Windows at least 10 years ago. And if it does still exist it should not be standardized where people will expect working Unicode.

tahonermann commented 6 years ago

Pretty sure the execution wide character set on Windows is the set formerly known as "UCS-2".

Microsoft switched from UCS-2 to UTF-16 with the Windows 2000 release 1. Of course, by then, it was too late to change the size of wchar_t.

MSVC encodes characters outside the BMP using surrogate code points as would be expected. For example, the following code is accepted with Visual Studio 2017 (with the /std:c++latest option):

static_assert(L"\U00010000"[0] == 0xD800);
static_assert(L"\U00010000"[1] == 0xDC00);

I guess an argument could be made that U+10000 is not technically a member of the execution wide character set because it can't be represented in a single wchar_t value. But I think most people would find that argument inconsistent with what is generally considered a character.

cubbimew commented 6 years ago

Yes, encoding of wide string literals and the recently-added filesystem paths is UTF-16, but wide char literals and the codecvts aren't, I like the idea of standardizing existing practice (even if I don't like the practice), but how different would [lex.ccon]/6 have to become to make it okay for L'\U0001F34C' to have the value 0xd83c?

cubbimew commented 6 years ago

...perhaps "The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character encoding, provided it is representable with a single code unit in that encoding"

dimztimz commented 6 years ago

Windows does tricks to conform to the standard. Whenever you use C or C++ standard library stuff, wprintf, wscanf, setlocale, std::locale, etc. wchar_t string are UCS2 strings, and when you use WinAPI, wchar_t strings are UTF-16 strings.

All supported multibyte locales on Windows with setlocale/std::locale can be mapped to UCS2. Because wchar_t is not UTF-32, windows does not support UTF-8 to be set via the standard library locales.

tahonermann commented 6 years ago

Yes, encoding of wide string literals and the recently-added filesystem paths is UTF-16, but wide char literals and the codecvts aren't

I'm not sure I'm following here. I don't think it makes much sense to think of wide character literals as having any particular encoding since they can only produce a single code unit. MSVC behaves a bit oddly (as you noticed) by mapping code points outside the BMP to the first code unit of their encoded representation. For example, MSVC accepts the following:

static_assert(L'\U00010000' == 0xD800);

I wouldn't be surprised if this is just an artifact of the implementation and not intentional behavior.

With regard to codecvt, I presume you are under the impression that codecvt<wchar_t, char, mbstate_t> converts between the execution character set and UCS-2? I haven't tested, but I would be quite surprised if that were the case.

I like the idea of standardizing existing practice (even if I don't like the practice), but how different would [lex.ccon]/6 have to become to make it okay for L'\U0001F34C' to have the value 0xd83c?

I don't think it needs to be much different. I think we can borrow and slightly modify [lex.ccon]/3 for this purpose (see below).

...perhaps "The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character encoding, provided it is representable with a single code unit in that encoding"

I suggest replacing [lex.ccon]/6 with:

"A character literal that begins with the letter L, such as L'z', is a wide-character literal. A wide-character literal has type wchar_­t. The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set provided that the c-char value is representable in the execution wide-character set and is representable with a single code unit. If the c-char is not representable in the execution wide-character set or would require multiple code units, then the value is implementation-defined. The value of a wide-character literal containing multiple c-chars is implementation-defined."

Note that updates will still be needed for [basic.fundamental]/5, and possibly other places.

tahonermann commented 6 years ago

Windows does tricks to conform to the standard. Whenever you use C or C++ standard library stuff, wprintf, wscanf, setlocale, std::locale, etc. wchar_t string are UCS2 strings ...

Can you provide some evidence for that claim? It doesn't match my understanding.

cubbimew commented 6 years ago

MS CRT's state notwithstanding, this raises a point about the specification of the C library's APIs. In order to support wchar_t representing a code unit rather than a code point, mbrtowc and wcrtomb would have to be modified to match mbrtoc16 and (after a post-C11 defect report) c16rtomb.

MSDN agrees: mbrtoc16 lists the -3 return code (and actually works as expected in my tests), mbrtowc doesn't.

dimztimz commented 6 years ago

It's all messed up on MSDN, this is as close as I can get https://msdn.microsoft.com/en-us/library/x99tb11d.aspx

The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8.

This means mb*to*w functions wont convert utf-8

https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx

This one does. They both take char* and output into wchar_t*.

tahonermann commented 6 years ago

MS CRT's state notwithstanding, this raises a point about the specification of the C library's APIs. In order to support wchar_t representing a code unit rather than a code point, mbrtowc and wcrtomb would have to be modified to match mbrtoc16 and (after a post-C11 defect report) c16rtomb.

I agree that the specification of mbrtowc and wcrtomb (and other wide character related functions) will require updates, however, I think the goal would be to update them to specify implementation defined behavior if a code point would require multiple code units rather than updating them to match mbrtoc16 and c16rtomb. I don't think we should try and impose behavioral changes on existing implementations (even if the behavior is broken).

Note that Microsoft currently does not support using setlocale to specify UTF-7 or UTF-8, so provoking one of these functions such that a surrogate pair would be produced would require using a locale with a non-Unicode encoding that has characters that are mapped outside the BMP. It would be interesting to test what actually happens in this case.

[1]: https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale

tahonermann commented 6 years ago

And I now see that I basically restated some things that @dimztimz already stated. Sorry for the redundancy!

dimztimz commented 6 years ago

Trying to redefine wide strings as variable length encoded strings will most likely fail, breaks too much. I'd vote against such a feature.

Instead let's focus on defining char16_t and char32_t right, first the language features i.e. impose UTF-16 and UTF-32, and then define library features.

tahonermann commented 6 years ago

Trying to redefine wide strings as variable length encoded strings will most likely fail, breaks too much. I'd vote against such a feature.

The goal is to update the standard to reflect actual existing practice, not to change actual behavior. The updates should not require any implementors to change behavior, nor break any existing code.

Instead let's focus on defining char16_t and char32_t right, first the language features i.e. impose UTF-16 and UTF-32, and then define library features.

I don't see a need for a dependency relationship here. Martinho submitted P1041R0 for the pre-Rapperswil mailing to mandate use of UTF-16/UTF-32 for char16_t/char32_t. See issue #6. I agree we'll need to expand library support for char16_t/char32_t.

cubbimew commented 6 years ago

I agree we'll need to expand library support for char16_t/char32_t.

This was attempted by n2035 in 2006 and only a tiny part of it passed through WG21 in Portland:

char_traits: 12 0 iostream: 1 9 fstream: 3 4 sstream: 2 1 facets (excluding codecvt): 3 4 codecvt: 11 0 regex: 2 7

(votes also listed in the next revision, n2207 )

tahonermann commented 6 years ago

This was attempted by n2035 in 2006 and only a tiny part of it passed through WG21 in Portland:

Thanks for that reference. What was proposed may not be what we would want to propose now. I meant that, generally, we need to expand support for char16_t/char32_t.

tahonermann commented 4 years ago

I've started on a paper to address this, so assigning myself to it.

tahonermann commented 3 years ago

Removed myself as an assignee since Corentin now has a draft paper (D2460R0) to address this issue.

tahonermann commented 2 years ago

I am closing this issue as resolved by the adoption of P2460R2 for C++23 despite remaining issues. The paper as adopted relaxes the restriction that wchar_t be able to hold all members of the character set associated with the wide literal encoding, but does not relax that restriction on the (run-time locale sensitive) execution wide-character set used by the standard library. The adopted change matches existing practice as demonstrated by Microsoft's implementation. Further work to relax restrictions for the standard library awaits a proposal with an acceptable migration plan.