tahonermann / text_view

A C++ concepts and range based character encoding and code point enumeration library
MIT License
122 stars 14 forks source link

Drop wchar_t support #11

Closed lichray closed 7 years ago

lichray commented 8 years ago

Conditionally providing this when __STDC_ISO_10646__ is defined is a bad idea. The non-portable nature of its code point value already proved that it's a bad choice of code point type (it's still relevant to some applications, but where it's relevant happens to be where __STDC_ISO_10646__ is not defined, e.g., on FreeBSD and Windows).

tahonermann commented 8 years ago

Support for wchar_t isn't only provided when STDC_ISO_10646 is defined. That macro only guards the definition of the iso_10646_wide_character_encoding encoding. The actual encoding for wide string literals is provided by the execution_wide_character_encoding type alias. When STDC_ISO_10646 is defined, that would presumably alias iso_10646_wide_character_encoding, but is otherwise present and aliases some other implementation defined encoding.

I agree that wide string literals are non-portable, but text_view still provides some support for working with wide strings in a portable way. For example, consider a multi-byte encoding that allows single byte code unit values to appear as the second byte of a multi-byte sequence. Naively splitting a string based on the intended single byte code unit value would incorrectly split the multi-byte sequence. String splitting implemented with the text_view code point iterators avoids this potential problem.

lichray commented 8 years ago

But supporting execution_wide_character_encoding cannot be called supporting wchar_t either. Either you uses locale information to fully support it, or drop it, otherwise I can't see any usefulness coming out of it.

tahonermann commented 8 years ago

execution_wide_character_encoding refers to the encoding that is used at compile-time by the compiler to encode wide string and character literals; the encoding that is controlled by the '-fwide-exec-charset=' gcc option. Naturally, this encoding may differ from encodings specified by run-time locale settings.

lichray commented 8 years ago

That's not nature. Locale is designed to work with wchar_t, see C 7.29.6, and many applications uses such a convention. It creates library composability issue if such a convention is not maintained. So I claim that relying on locale information is best you can do; relying on other source of information to perform automatic conversion is as bad as forking the standard.

tahonermann commented 8 years ago

I understand. In C and C++ applications, the wchar_t interfaces operate using the encoding specified by locale settings established at run-time. No disagreement. C++ [lex.charset]p3 is clear that the locale settings govern the interpretation of wchar_t code unit sequences at run-time.

However, that doesn't change the fact that there is an additional encoding involved at compile-time; the encoding used to encode wide string and character literals. This encoding is not based on run-time locale settings (for the application; it may be based on the run-time locale settings in effect for the compiler invocation itself). This is the encoding used by the compiler to translate universal character names appearing in wide string and character literals to the code unit sequence it emits for the literal (which may, of course, encode a replacement character). This encoding may correspond to the basic execution wide character set, or may be an extension of it. It is possible for this encoding to be a super set of the basic execution wide character set and a subset of the (run-time selected) execution wide character set. If it isn't compatible with the (run-time selected) execution wide character set, well, chaos ensues, but that is status quo.

It may be that the encoding I'm referring to here has limited use. In fact, it may be that it is only useful for doing compile-time constexpr or TMP manipulation of wide string and character literals. However, I do believe the support for this is sound.

lichray commented 8 years ago

The purpose of adding universal character names is not let you to put them in "" or L"". It sometimes works, by accident, by historical reasons, but that should not be a practice that we are promoting. Of course many homemade way to construct applications are sound and it's also sound to build some library to solve these needs, but I don't think the standard care (of course you can put these in a paper and ask in LEWG).

Another way to look at it is that you can put the interface at an acceptable level of genericity, and leave some parts implementation-defined ("implementations may support additional blah blah blah"). For instance, libstdc++ has an iconv codecvt in ext/. It's not mandated by the standard, but there might be people using it.

tahonermann commented 8 years ago

My goal is genericity and sound, consistent behavior. I'm definitely open to controversial support being left as optional or implementation defined, so long as the interface doesn't preclude such support.

tahonermann commented 7 years ago

Closing this issue. I still intend to support wide string encodings.