Terminology clarifications

ruoso commented 7 years ago

"Native type" vs "Code Unit"

While the unicode standard does use the term "Code Unit", I would advocate for referencing it as the "native type", meaning "how we decided to represent that encoding natively in the computer". This is important because using "code unit" may actually be confusing in some situations.

For instance, if you're doing UTF32 with a foreign endianess, the 'native type' will have to be the size and alignment of 'unsigned char', while UTF32 with native endianess can operate directly with ints.

In both cases, the term "code unit" is understood to be the size and alignment of a 32bit int, but for all intents and purposes, the computer will have to deal with octets when dealing with UTF32 with foreign endianess.

In fact, if the encoding even supports switching at runtime between UTF32BE and UTF32LE, the 'native type' has to be the size of an octet since there may be conversion required at runtime.

i.e.: int array is not a valid representation when handling UTF32 with foreign endianess, which will directly affect how you can write the code. I don't think we'd think the "Code Unit" (as used by the unicode standard) would ever be "char" when dealing with UTF32.

Character

The term 'character' is ambiguous, as it collides with:

the "char" type
the "character", addressed by a codepoint
the "user perceived character", also known as "grapheme cluster"

I think the only sane option is to just never use it at all. And always use "codepoint" when you mean a codepoint, "grapheme" when you mean a "grapheme".

tahonermann commented 7 years ago

The C++ standard is due for some terminology updates. I briefly spoke with Richard Smith about it earlier this year and I have it on my todo list to submit a paper proposing some cleanup. However, doing so isn't particularly high on my todo list.

Have you read the terminology definitions provided in the text_view readme (https://github.com/tahonermann/text_view#terminology)? I've tried to mostly adhere to Unicode terminology, but have deviated in some cases. For example, I don't distinguish between encoding forms and encoding schemes. If you think any of these definitions is at odds with the Unicode standard, please let me know.

I'm not sure how much you've looked at the UTF-32 encodings provided by Text_view. There are 4:

utf32_encoding: Uses char32_t as the code unit type, assumes native endian.
utf32be_encoding: Uses char as the code unit type, assumes big-endian.
utf32le_encoding: Uses char as the code unit type, assumes little-endian.
utf32bom_encoding: Uses char as the code unit type, assumes big-endian unless a BOM is present.

I think code unit is an appropriate term for each of these. "native type" wouldn't carry any meaning for the latter three.

I have struggled with the term character due to its historical significance in C++ being somewhat at odds with Unicode and modern terminology. I like having consistency between the terms "character set" and "character". I wouldn't want to keep "character set" and change "character" to "grapheme". And I think switching to "grapheme set" would be too much of a leap from historical usage; "character set" is pretty firmly entrenched.

Long term, I think the right approach is to update the C++ standard to use modern terminology. There isn't anything that can reasonably be done about the name of the 'char' type. If text_view ever gets standardized, the committee will have the opportunity to rename as it sees fit.

tahonermann commented 7 years ago

Closing this issue. I think the existing terminology being used is appropriate given historical and modern usage. As mentioned in the previous comment, should text_view be standardized, the committee will review and rename as it deems appropriate.

tahonermann / text_view

Terminology clarifications #24

"Native type" vs "Code Unit"

Character