tahonermann / text_view

A C++ concepts and range based character encoding and code point enumeration library
MIT License
122 stars 14 forks source link

Codepoint does not depend on character set #25

Closed ruoso closed 7 years ago

ruoso commented 7 years ago

A codepoint in the unicode standard is actually an absolute value that does not depend on encoding at all. You should be able to compare codepoints as they are read from texts in different encodings.

In fact, this association between codepoint and character set is what is preventing you from naturally having transliterations "just work". If the codepoint was as specified in the standard, doing the transliteration would literally be a matter of traverse an input iterator and set the values in an output iterator.

ruoso commented 7 years ago

Ok, I think I get what you were getting at... http://www.unicode.org/reports/tr17/#CodedCharacterSet uses "Coded Character Set" to reference the diversity between what is called "legacy encodings" and the unicode characters, and how the "unicode encodings" have "exactly the same repertoire and mapping" as the one described by the unicode standard.

However, while this distinction is useful from a theoretical point of view, in the sense that it would theoretically allow you to reason about legacy encodings without converting the natively stored value into unicode codepoints.

In practice having full support for it would require creating parallel character metadata databases to allow performing all the appropriate operations, such as "lower-case" or "upper-case", when the unicode character database already handles all those cases.

The only precondition is that even when dealing with 'legacy encodings' you should always read a unicode codepoint out of the legacy encoding instead of having to interoperate different character databases.

Basically, the only database you would need is the mapping from the data in the legacy encoding to the unicode codepoint, after which point you would be able to make all the same reasoning you can with the unicode encodings.

tahonermann commented 7 years ago

A code point is an integral value that denotes an abstract character in some character set. As such, a code point is meaningless by itself. Within text_view, the association between a code point and the abstract character it denotes is maintained by the character class. Note that all of the Unicode encodings use character as their character type. So, transcoding between, for example, UTF-8 and UTF-32, can be done today just using std::copy(). However, much better performance could be attained with a more specialized interface.

Support for legacy encodings is a primary objective for text_view. The intent is that the text_view interfaces support the encodings historically used for ordinary and wide strings as well as the Unicode encodings.

Text_view does not currently support character properties (Unicode or otherwise). I would like to see that support added. The approach I've had in mind is to create character property interfaces specific to Unicode (probably in a std::[text::]unicode namespace). For ordinary and wide strings, the standard already provides the C interfaces for querying character properties and performing operations like those you mentioned. I have thought about adding generic interfaces for any character. The implementations of those would presumably transcode to Unicode as necessary and utilize the Unicode database, but an implementation could then elect to specialize the interface for specific character sets to optimize performance. I'm not sure there is a significant need for this though. Regardless, I need to get transcoding interfaces designed first!

tahonermann commented 7 years ago

Closing this issue. As described in the previous comment, this appears to be a terminology issue. The association between a code point value and an abstract character is maintained by the character set type. Each character holds a code point value and specifies its associated character set.