Actually use UTF-8 - Githubissues

ghost commented 4 years ago

I've noticed that the code uses narrow character and string literals while the protocol requires UTF-8. This makes this library unusable on z/OS which uses EBCDIC for narrow literals. In order to actually use UTF-8 literals you need to start them with u8. This returns char since C++11 and char8_t since C++20.

urnathan commented 4 years ago

ah, right I'd forgotten/ignored non-ascii collating (for the does-this-need-quoting pieces). I was also making an implicit assumption that the two ends used the same char set (for the message comparison pieces), which I guess wouldn't be true for cross-compiling. I'm unfamiliar with how ebcdic/utf8 interact -- I had thought that ebcdic happened to also use 7-bits, so the utf8 bit-8 scheme also worked there without changing the underlying char encoding. IIUC you're saying u8'a' is 0x61, regardless. Is that right? so UTF8 is both (a) a specific mapping between characters and integers and (b) an encoding of those integers into 8bit octets?

eta: also, need to figure why github doesn;t email me about issues ...

ghost commented 4 years ago

IIUC you're saying u8'a' is 0x61, regardless. Is that right? so UTF8 is both (a) a specific mapping between characters and integers and (b) an encoding of those integers into 8bit octets?

There is a very deep rabbit hole when we start talking about "source character set" and "source character encoding" which is a very hot topic in SG16 right now. So, since I assume that the source code of your library is in ASCII or UTF-8, we can sidestep it and trivialize many things, so let's assume that the source code is written in UTF-8 and consumed by compiler as UTF-8.

Then two things are certain:

auto a = 'a';
std::cout << static_cast<int>(a) << '\n'; // implementation-defined

auto b = u8'a';
std::cout << static_cast<int>(b) << '\n'; // always 97 (0x61)

At least as of right now, there is an understanding that char literals are converted to "execution character set" regardless of what character set/encoding the source file is written in. On z/OS the default execution character set is EBCDIC so in our case the compiler will consume UTF-8 source file as UTF-8 (this may require passing explicit compiler flags in reality), do phases of translation (this includes conversion of the source code to "internal compiler encoding used during translation"), will see literal 'a' and manually convert it from "internal compiler encoding used during translation" to EBCDIC. The literal u8'a' will be treated differently but the integer value will always be 97 in the end.

The standardese will mostly likely change wildly in C++23 but the basic principle will stay the same.

urnathan commented 4 years ago

Ah great, thanks for clarifying -- whenever I google this the results say 'you don't have to care about EBCDIC anymore', if they mention it at all!

urnathan commented 4 years ago

I've just push a patch for this. Excitingly C++11 doesn't have utf8 char literals, only utf8 string literals. I've probably flubbed something.

ghost commented 4 years ago

You are correct, I forgot about that. Yeah, it's a shame that this requires more tricks with literals to get sane code.

urnathan / libcody

Actually use UTF-8 #1