Closed ghost closed 4 years ago
As far as I know Unicode standard never defines what character is. The closest term for a character is grapheme cluster.
It is true that it does not define what a character is. It defines "abstract character", for example. However, the vast majority of the uses of the the word "character" in the Unicode Standard refer to concepts similar to code points, not similar to grapheme clusters. I think the only time the Unicode Standard uses the word to mean something more like a grapheme cluster than a code point, it does so with scare quotes ('“character”', or 'end-user “character”') and those uses are restricted to discussions contrasting code point-like concepts with grapheme-like concepts.
I would just change the specification to say "as a sequence of code points".
I would just change the specification to say "as a sequence of code points".
But why would we then define that the One True Way to iterate Unicode text is code points? We are repeating the same mistake basic_string
did.
Remember that algorithms usually use just std::begin
and std::end
. We would need a different class to use algorithms with non-code-point iterators.
This is a different discussion, though. (And it's one that we will need to have when we start seriously working on this). "code point" is the intended meaning of the proposal as is, so this change would fix the ambiguity of terms without changing the intended meaning.
Even if the paper would use the term "code point", the name of the class is ambiguous.
And iterating code points is not that common actually. For file and network I/O you need the number of code units. For rendering you need the number of grapheme clusters.
Even if the paper would use the term "code point", the name of the class is ambiguous.
Ok, that's fair enough.
Small correction, though: the number of grapheme clusters is actually irrelevant for rendering. The number of glyphs may be useful, but that doesn't really map to grapheme clusters and is information that cannot be retrieved without a font at hand. Note that grapheme clusters aren't even useful to separate a string before doing glyph lookup, as glyphs may span grapheme cluster boundaries.
Grapheme clusters are useful for user interaction, like selection, using the left/right arrows, or using the Delete key (but not Backspace! for that you are more likely to want code points)
Yet text_view gives only single begin and end functions. I think we should standardize code_point_sequence_view because it has unambiguous name. After that we can standardize grapheme_cluster_sequence_view and higher level stuff.
text_view has been due for some updates for a long time now; I just haven't had time to get to it.
We've acknowledged that there are use cases for all of code point, (extended) grapheme cluster, word, sentence, etc... enumeration. I'm quite sure we'll end up providing views for code points and (extended) grapheme clusters. What names will be proposed for those is TBD.
I think there is a lot of support for types named std::text
and std::text_view
. We've had numerous discussions in SG16 and at committee meetings regarding what the value type of a std::text
or std::text_view
should be. @Lyberta, it sounds like you might prefer avoiding these names in favor of more explicit ones. I fear that may turn users off from using them though. If a user first has to understand the distinction of code points and grapheme clusters, we'll lose some support.
If a user first has to understand the distinction of code points and grapheme clusters, we'll lose some support.
But if the user doesn't understand the distinction, we will be left with buggy code. Current design makes it easy to use the API incorrectly.
But if the user doesn't understand the distinction, we will be left with buggy code. Current design makes it easy to use the API incorrectly.
I agree. I think the current consensus is that we'll want to provide grapheme clusters as the default "character" that users work with, but provide access to the code point (and code unit) sequence in other ways. @tzlaine's Boost.Text work prototypes this approach.
I'd say text
and text_view
shouldn't have begin
and end
functions. They should have to_code_units
, to_code_points
, to_grapheme_clusters
that return classes with unambiguous names and purposes.
I'd say
text
andtext_view
shouldn't havebegin
andend
functions. They should haveto_code_units
,to_code_points
,to_grapheme_clusters
that return classes with unambiguous names and purposes.
I appreciate the clarity of intent in providing such an interface, but in practice, I think it would make for a cumbersome type to use. Programmers that aren't experts in Unicode don't want to have to worry about these distinctions; and in fact, worrying about these could be a distraction from what they are actually trying to get done. I also worry about what a text_iterator
is in this model. If all of the encoding layers are on equal ground, can I safely and efficiently convert iterators across them? If I have a code point iterator, how do I call a function that was written to take a grapheme cluster iterator? The answer can't be, don't do that, pass a text_view
object around instead because how do I construct a text_view
object from code unit or code point iterators with appropriate assurances that a grapheme cluster hasn't been split?
I believe, that to reach most programmers, we need to provide simple types that, for most purposes, just do the right thing by default, but expose the underlying data as needed for experts. Think of a need to search some text for a particular "character". Let's say the character to match is a member of the basic source character set, 'X' for example. If the programmer has to be aware that 'X' can have combining code points and that the grapheme cluster interfaces must therefore be used unless matching a base character with combining character(s) is desired, then we've already lost. We need to ensure that the result of something like find(t, 'X')
is 1) useful, and 2) the right choice for most purposes. I think the answer is that such a function should 1) match grapheme clusters (not code points, certainly not code units), and 2) return a character_reference
(or similar) type from which a grapheme cluster iterator can be obtained (and then lowered to a code point or code unit iterator if needed).
Within SG16, consensus has been moving towards making text
and text_view
grapheme cluster based (perhaps with appropriate abstractions for legacy encoding support), but with access to the code unit and code point sequence exposed and the ability to manipulate the code unit sequence (preferably by lowering a grapheme cluster or code point iterator to a code unit iterator so that boundaries are maintained properly for all encoding levels).
Programmers that aren't experts in Unicode don't want to have to worry about these distinctions
Yes, iterating grapheme clusters would be the least surprising behavior to novice programmers. This would be the rare occurrence of string type not being broken.
If all of the encoding layers are on equal ground, can I safely and efficiently convert iterators across them? If I have a code point iterator, how do I call a function that was written to take a grapheme cluster iterator?
Converting to the lower level is trivial while converting to upper is not. We will need helper functions to do this.
If I have a code point iterator, how do I call a function that was written to take a grapheme cluster iterator? The answer can't be, don't do that, pass a text_view object around instead because how do I construct a text_view object from code unit or code point iterators with appropriate assurances that a grapheme cluster hasn't been split?
Consider std algorithms such as std::for_each
, std::distance
, std::rotate
, etc. They need a pair of iterators or a range in the future, they don't need to know what text is and its layers. We will need a type for each layer so that the code can stay generic.
If the programmer has to be aware that 'X' can have combining code points and that the grapheme cluster interfaces must therefore be used unless matching a base character with combining character(s) is desired, then we've already lost.
No, the programmer lost. I don't want to hide bugs until later time. Look at what raw pointers and basic_string has done - infinite number of bugs that cost insane amount of money and manpower to maintain. Again, yes, if text_view
is just a novice-friendly name for grapheme_cluster_sequence_view
then I'm fine.
I've implemented CodePointSequence for my purposes and after looking at Boost.Text and text_view paper I think this design is the most promising:
template <TextEncoding ET, std::endian Endianness = std::endian::native, typename Allocator = std::allocator<std::byte>>
class code_unit_sequence;
template <TextEncoding ET, std::endian Endianness = std::endian::native>
class code_unit_sequence_view;
template <typename T>
concept bool CodeUnitSequence();
template <typename T>
concept bool CodeUnitSequenceView();
template <CodeUnitSequence Container, TextEncoding ET = default_encoding_type_t<Container>>
class code_point_sequence;
template <CodeUnitSequenceView VT, TextEncoding ET = default_encoding_type_t<VT>>
class code_point_sequence_view;
I think having separate big-endian and little-endian encodings is not useful. Endianness matters only at the byte level so there should be class templates that handle it. code_unit_sequence
and code_unit_sequence_view
are such classes. basic_string
can be fine too although it would be limited to native endianness (or extremely complex char_traits that nobody will bother with). The rest of the code should be completely agnostic to endianness.
code_point_sequence
builds on top of CodeUnitSequence
concept. It will provide bidirectional iterators that will return proxy type that will reallocate the underlying buffer in case the number of code units in referenced code point position changes.
code_point_sequence_view
is trivial.
Yes, iterating grapheme clusters would be the least surprising behavior to novice programmers. This would be the rare occurrence of string type not being broken.
Funnily enough, Swift does this, and their string type is currently broken because of it https://bugs.swift.org/browse/SR-375.
I think that string type that has .characters.count is fundamentally broken.
Also I wanted to say that I would like to implement code_unit_sequence and code_point_sequence and produce a formal paper. I just want a blessing.
I think that string type that has .characters.count is fundamentally broken.
Well, that's just the most trivial way of demonstrating how it's broken. Iterating over Swift strings also produces similarly broken results.
Also I wanted to say that I would like to implement code_unit_sequence and code_point_sequence and produce a formal paper. I just want a blessing.
Come join us on Slack https://cpplang.slack.com/messages/sg16-unicode, or the mailing list http://www.open-std.org/mailman/listinfo/unicode, or even join our next teleconference http://www.open-std.org/pipermail/unicode/2018-June/000037.html
I just want a blessing.
@Lyberta No blessing is necessary of course. We (SG16) have been wrestling with this question for a while now, but haven't made any decisions one way or another. The next pre-meeting mailing is quite some time away and I do plan on scheduling time to discuss code points vs grapheme clusters at our meetings in the not too distant future. So, I'll echo what Martinho said; join us on Slack, the mailing list, and our telecons (invite info is on the mailing list and I can send you an invite on request if you like). You'll get more immediate feedback and be better able to contribute to our direction than by writing a paper (at least in the short term). I do think we'll want to write a paper on this subject at some point, but I think it would be great if it were collaboratively developed within SG16 with a goal of presenting an agreed upon approach with pros/cons to the rest of the committee.
I'm closing this issue as non-actionable since there does not appear to be consensus for a particular direction. The concerns raised will need to be addressed as part of https://github.com/sg16-unicode/sg16/issues/31. Anyone wishing to propose a specific solution is encouraged to open a new issue or to submit a paper.
I've read the
text_view
proposal and I think it uses very ambiguous terminology such as:As far as I know Unicode standard never defines what character is. The closest term for a character is grapheme cluster.
Second, there are many ways to iterate over Unicode data such as:
Yet
text_view
gives only singlebegin
andend
functions. I think we should standardizecode_point_sequence_view
because it has unambiguous name. After that we can standardizegrapheme_cluster_sequence_view
and higher level stuff.