tahonermann / text_view

A C++ concepts and range based character encoding and code point enumeration library
MIT License
122 stars 14 forks source link

Drop char support #12

Closed lichray closed 7 years ago

lichray commented 8 years ago

It's code point value is not portable. Traditionally, 8-bit narrow encodings use unsigned char as its internally encoding (and that's also C narrow collate functions, e.g. isdigit etc., expecting). But this is locale-depended I don't think text_view should support that (Hate to say that, iostream itself is good enough for basic locale-depended encoding and decoding).

tahonermann commented 8 years ago

The encoding for ordinary string and character literals is implementation defined and therefore non-portable, but that doesn't mean that the interfaces that text_view provides aren't portable for working with whatever encoding the implementation uses.

Copying from my comment in issue #11:

"consider a multi-byte encoding that allows single byte code unit values to appear as the second byte of a multi-byte sequence. Naively splitting a string based on the intended single byte code unit value would incorrectly split the multi-byte sequence. String splitting implemented with the text_view code point iterators avoids this potential problem"

An argument against the above is that one should transcode from an external encoding to an internal one prior to doing such manipulations. While I agree that is good design, the reality is that we have a lot of code that isn't written that way and, there are scenarios where that doesn't work particularly well. For example, file names. File names don't have an associated encoding (even on Windows, the implied UTF-16 encoding is not enforced), but it may be necessary to work with file names and paths using an assumed encoding, but where round tripping through an internal encoding is not an option because the round trip may not be value preserving.

lichray commented 8 years ago

Same, without using locale information, char or unsigned char can't be well supported. 'a' may be neither ASCII nor UTF-8 (on mainframe, it's EBCDIC, but u'a' is required to be UTF-16).

tahonermann commented 8 years ago

execution_character_encoding refers to the encoding that is used at compile-time by the compiler to encode ordinary string and character literals; the encoding that is controlled by the '-fexec-charset=' gcc option. Naturally, this encoding may differ from encodings specified by run-time locale settings.

lichray commented 8 years ago

C++ 2.3/3

The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

So you can only assume basic source character set (2.3/1) plus several are independent from locale (another way to look at it is that locale must be compatible with that). Others are locale-specific. These -fexec stuff are not portable.

lichray commented 8 years ago

Of course char has wild nature within, since we didn't add char8_t (suggested for C++14 in a NB comment, but rejected). In that case it's portable, as literals with locale-independent external encoding. On Windows wchar_t can be external but you don't need this option, on other systems it just can't be external.

tahonermann commented 8 years ago

Relevant comment in #11: https://github.com/tahonermann/text_view/issues/11#issuecomment-181698584. Perhaps, with regard to locale relevance, we should stick to discussion in just one of these issues for now, I suggest #11 since that is where my last comment was added and I don't think anything particularly new has been brought up here.

I would be interested in more information regarding the rejection of the char8_t proposal. Can you direct me to where I can find the NB comment and committee response?

lichray commented 8 years ago

My memory mixed two things: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3770.html#GB1 this asks for u8 to fit into char or change the type to unsigned char, and in here http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3398.html char8_t was proposed to be an alias to unsigned char.

Anyway the ship has sailed.

lichray commented 8 years ago

CWG just required char to be able to represent u8 for the first one.

tahonermann commented 8 years ago

Visual C++ in Visual Studio 2015 Update 2 CTP now has additional command line options for specifying the (compile-time) execution character set: /execution-charset:<iana-name>|.NNNN and /utf-8 (a synonym for /source-charset:utf-8 /execution-charset:utf-8).

https://blogs.msdn.microsoft.com/vcblog/2016/02/22/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/

The intention is that text_view's execution_character_encoding type alias reference an encoding compatible with the encoding controlled by this new option. In other words, when the /execution-charset option is specified, it should influence the definition of execution_character_encoding. When that option is not specified, execution_character_encoding should reference an encoding compatible with the active code page for the compiler invocation.

Note that the new /execution-charset option has no effect on the wide execution character encoding (that is always UTF-16LE for Microsoft's compiler), so these options are not relevant for issue #11.

lichray commented 8 years ago

Just use u8 literals. This /execution-charset option, at best, can help porting old applications, afaics.

tahonermann commented 8 years ago

Switching to u8 literals would be prohibitively expensive for many applications. One of my goals is to provide support for legacy applications. Note that the /execution-charset option is applicable to such applications.

lichray commented 8 years ago

I doubt there are legacy applications using charset other than "C" locale in narrow string literals, since it was not supported by VC.

tahonermann commented 8 years ago

There are tons of such applications. VC has supported characters outside the basic source character set for as long as I can remember. The interpretation of such characters depends on the source encoding. If a source file is UTF-8 with a BOM, VC will decode it as UTF-8. If the source file "looks like" UTF-16, then VC will decode it as UTF-16. Otherwise, the source file is decoded according to the active code page at the time of compilation. The decoded literals are then transcoded to VCs internal encoding (UTF-8), and then transcoded to the (compile-time) execution encoding. Traditionally, options for controlling the compile-time execution encoding were limited. With the recent updates, users now have more control.

lichray commented 8 years ago

I'm not talking about source encoding, which is not relavent here. In VC you can only put UTF-8 characters in narrow string literals, and in that case your utf8 view will just work. You can not use other encodings in VC, so why we need execution_character_encoding?

tahonermann commented 8 years ago

What makes you believe that VC only supports UTF-8 for narrow string literals? That isn't correct. Internally, characters decoded from the source file are transcoded to UTF-8 before being transcoded to the execution character set, but the execution character set is not UTF-8. The (compile-time) execution character set is traditionally determined by the active code page, but is now under control of the user via the new /execution-charset option.

If VC only supported UTF-8 as the execution character set (which is not the case), then execution_character_encoding would alias the UTF-8 encoding. But, clearly, it would be wrong for execution_character_encoding to alias UTF-8 for all compilers and platforms. This alias provides a way to write portable code that works with implementation defined encodings.

lichray commented 8 years ago

If you mean #pragma setlocale, the paper you refered explained why it doesn't work https://blogs.msdn.microsoft.com/vcblog/2016/02/22/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/

tahonermann commented 8 years ago

No, I don't mean #pragma setlocale. The normal operation of the compiler is to use the active code page at compile-time as the execution character set.

lichray commented 8 years ago

OK. I take that back.

But what do you want to do about it, if I have a program with GBK narrow string literals?

lichray commented 8 years ago

Since it's not UTF-8, you can't decode it to any other Unicode encodings without enforcing every C++ standard libraries carrying an encoding conversion library (GBK is a subset of GB18030 which is a Unicode encoding; what about Shift-JIS?). You can't decode it to wchar_t either, since a library should not setlocale or uselocale, at best you use newlocale, but the outcome is still bad, this wchar_t is not as same as the wchar_t understood by other libraries.

tahonermann commented 8 years ago

I don't want to do anything about it. I want execution_character_encoding to reflect the encoding that the compiler is actually using. That is all. If the compiler supports multiple options for the (compile-time) execution encoding, then it could supply a separate encoding class for each, or one that has conditional behavior based on the (compile-time) selection. It is perfectly legitimate for the compiler to limit the compile-time execution encoding to the basic execution character set. This is implementation defined behavior and up to the compiler vendor to support what they want to.

None of this forces any C++ standard library to carry an encoding conversion library beyond that which they already do for the compilers they support.

The intent is solely to enable enumerating the code points encoded in string literals in the implementation defined (compile-time) execution encoding. That's it.

tahonermann commented 8 years ago

I want to make it clear; I'm not inventing anything with this support. All compilers have to have a compile-time execution encoding. It may just be the basic execution encoding. If so, great, then execution_character_encoding aliases basic_execution_character_encoding. For compilers that (already) support extended (compile-time) execution character encodings, then execution_character_encoding aliases the class for that encoding.

lichray commented 8 years ago

Let's say I use -fexec-charset=gbk in gcc. May libstdc++'s execution_character_encoding alias basic_execution_character_encoding in that case? If it's allowed and libstdc++ did, the concept is useless, because I cannot walk on code points then. If this is not allowed, there will be a strong coupling between what compiler can support and what library can support, and compilers may end up with generating these classes directly, like lambdas.

After that I can walk on code points, but the code point type is again implementation-defined, as a user you cannot choose it, so you are not guaranteed to able to inspect what these characters are. Why I want to use execution_character_encoding then? How this different from an ext::iconv_encoding? It looks like execution_character_encoding mandated nothing but the name.

tahonermann commented 8 years ago

The encoding of ordinary and wide string literals is not necessarily the same encoding as the one used at run-time when working with ordinary and wide strings via the run-time null-terminated sequence utilities. There is a requirement that these encodings be compatible (otherwise we get Mojibake at run-time). This is not something that I am inventing. Is there anything about this that you disagree with? I'm only trying to describe present day reality here.

I am able to inspect the decoded character. The encoding referenced by execution_character_encoding has an associated character set and type that can be used to identify the character (and perhaps transcode it to something else, though I haven't yet defined transcoding interfaces). The underlying code unit sequence has the exact same meaning as it would using the null-terminated sequence utilities at run-time (or at least it better, otherwise Mojibake). The encoding and character set is implementation defined, not unknowable.

The different is that the encoding referenced by execution_character_encoding is not locale aware as, presumably, ext::iconv_encoding would be. However, the encoding that it references must be compatible with the locale sensitive encoding used at run-time, otherwise Mojibake.

lichray commented 8 years ago

Then you may not be able to inspect the code point, as the code point type may be int, containing internal codes of GBK.

Since execution_character_encoding is required to be able to decode execution character encoding and libstdc++ cannot bundle libiconv, gcc may have no choice better than generating classes for each encoding iconv can support, to decode them to its official internal codes, which is the cheapest thing to do, but still lots of work. And what the type of the code point would be? It cannot be wchar_t since in glibc it's UCS-4. To distinguish them, int is chosen, and you as a user can do pretty much nothing about it.

ext::iconv_encoding is not locale-aware. It just take an extra argument specifying the encoding (here we pass execution character encoding) and decode it to UTF-32 in char_32. Since it's not mandated in the standard, libstdc++ can provide it and ask users to link libiconv by themselves, and the users can further make use of these code points with their Unicode collate library, or ICU.

We are better off if this thing is not standardized.

tahonermann commented 8 years ago

The code point type is determined by the character set type that is selected by the encoding class. The encoding, character set, and code point type are all known at compile time, so the code point values are very much inspectable. The code point type is selected to ensure that all code points for the character set can be uniquely represented.

execution_character_encoding is not required to be able to decode the run-time execution encoding, only the compile-time execution encoding. Perhaps this is where the confusion lies. I probably erred in using the name execution_character_encoding here.

Perhaps this would help. Rename execution_character_encoding to compile_time_execution_character_encoding and replace execution_character_encoding with an encoding class that is locale aware (and compatible with the existing null-terminated sequence utilities). Does this make things more clear? This encoding would use character as its character type and would have a code point type sufficient to represent the code points of all supported character sets. At run-time, code can then use the character set IDs to determine what character a code point signifies.

lichray commented 8 years ago

I do mean the encoding used at compile time, but the conversion has to be done at runtime, so execution_character_encoding has to one encoding for one program. For example, if I have a program -fexec-charset=utf-8, execution_character_encoding needs to support UTF-8. I'm not seeing a confusion here.

lichray commented 8 years ago

What I'm saying is execution_character_encoding is way too implementation defined, making it adding lots of work to the compiler, but still not useful enough to the users.

A locale-aware view can be pure library, useful to the users since the code point type is locked down to wchar_t. But I'm not sure whether we really want it, since the streams library do this sufficiently well. I'll leave LEWG to answer this.

A iconv-based view may not be put into the standard, but it's also useful to the users, as its encoding type can be locked down to anything, including char32_t.

But execution_character_encoding's code point type is in the wild. By saying "you can't inspect it", I don't mean you don't know the type. You know the type, it's GBK_int_t, does this help anything? There is no isdigit(GBK_int_t), that's what I mean I can't inspect it.

tahonermann commented 7 years ago

Closing this issue. I still intend to support narrow string encodings.