sg16-unicode / sg16

SG16 overview and general information
46 stars 5 forks source link

scalar_value_sequence[_view] #16

Closed ghost closed 4 years ago

ghost commented 6 years ago

I've read the text_view proposal and I think it uses very ambiguous terminology such as:

Objects of basic_text_view class template specialization type provide a view of an underlying code unit sequence as a sequence of characters.

As far as I know Unicode standard never defines what character is. The closest term for a character is grapheme cluster.

Second, there are many ways to iterate over Unicode data such as:

Yet text_view gives only single begin and end functions. I think we should standardize code_point_sequence_view because it has unambiguous name. After that we can standardize grapheme_cluster_sequence_view and higher level stuff.

rmartinho commented 6 years ago

As far as I know Unicode standard never defines what character is. The closest term for a character is grapheme cluster.

It is true that it does not define what a character is. It defines "abstract character", for example. However, the vast majority of the uses of the the word "character" in the Unicode Standard refer to concepts similar to code points, not similar to grapheme clusters. I think the only time the Unicode Standard uses the word to mean something more like a grapheme cluster than a code point, it does so with scare quotes ('“character”', or 'end-user “character”') and those uses are restricted to discussions contrasting code point-like concepts with grapheme-like concepts.

I would just change the specification to say "as a sequence of code points".

ghost commented 6 years ago

I would just change the specification to say "as a sequence of code points".

But why would we then define that the One True Way to iterate Unicode text is code points? We are repeating the same mistake basic_string did.

Remember that algorithms usually use just std::begin and std::end. We would need a different class to use algorithms with non-code-point iterators.

rmartinho commented 6 years ago

This is a different discussion, though. (And it's one that we will need to have when we start seriously working on this). "code point" is the intended meaning of the proposal as is, so this change would fix the ambiguity of terms without changing the intended meaning.

ghost commented 6 years ago

Even if the paper would use the term "code point", the name of the class is ambiguous.

And iterating code points is not that common actually. For file and network I/O you need the number of code units. For rendering you need the number of grapheme clusters.

rmartinho commented 6 years ago

Even if the paper would use the term "code point", the name of the class is ambiguous.

Ok, that's fair enough.

Small correction, though: the number of grapheme clusters is actually irrelevant for rendering. The number of glyphs may be useful, but that doesn't really map to grapheme clusters and is information that cannot be retrieved without a font at hand. Note that grapheme clusters aren't even useful to separate a string before doing glyph lookup, as glyphs may span grapheme cluster boundaries.

Grapheme clusters are useful for user interaction, like selection, using the left/right arrows, or using the Delete key (but not Backspace! for that you are more likely to want code points)

tahonermann commented 6 years ago

Yet text_view gives only single begin and end functions. I think we should standardize code_point_sequence_view because it has unambiguous name. After that we can standardize grapheme_cluster_sequence_view and higher level stuff.

text_view has been due for some updates for a long time now; I just haven't had time to get to it.

We've acknowledged that there are use cases for all of code point, (extended) grapheme cluster, word, sentence, etc... enumeration. I'm quite sure we'll end up providing views for code points and (extended) grapheme clusters. What names will be proposed for those is TBD.

I think there is a lot of support for types named std::text and std::text_view. We've had numerous discussions in SG16 and at committee meetings regarding what the value type of a std::text or std::text_view should be. @Lyberta, it sounds like you might prefer avoiding these names in favor of more explicit ones. I fear that may turn users off from using them though. If a user first has to understand the distinction of code points and grapheme clusters, we'll lose some support.

ghost commented 6 years ago

If a user first has to understand the distinction of code points and grapheme clusters, we'll lose some support.

But if the user doesn't understand the distinction, we will be left with buggy code. Current design makes it easy to use the API incorrectly.

tahonermann commented 6 years ago

But if the user doesn't understand the distinction, we will be left with buggy code. Current design makes it easy to use the API incorrectly.

I agree. I think the current consensus is that we'll want to provide grapheme clusters as the default "character" that users work with, but provide access to the code point (and code unit) sequence in other ways. @tzlaine's Boost.Text work prototypes this approach.

ghost commented 6 years ago

I'd say text and text_view shouldn't have begin and end functions. They should have to_code_units, to_code_points, to_grapheme_clusters that return classes with unambiguous names and purposes.

tahonermann commented 6 years ago

I'd say text and text_view shouldn't have begin and end functions. They should have to_code_units, to_code_points, to_grapheme_clusters that return classes with unambiguous names and purposes.

I appreciate the clarity of intent in providing such an interface, but in practice, I think it would make for a cumbersome type to use. Programmers that aren't experts in Unicode don't want to have to worry about these distinctions; and in fact, worrying about these could be a distraction from what they are actually trying to get done. I also worry about what a text_iterator is in this model. If all of the encoding layers are on equal ground, can I safely and efficiently convert iterators across them? If I have a code point iterator, how do I call a function that was written to take a grapheme cluster iterator? The answer can't be, don't do that, pass a text_view object around instead because how do I construct a text_view object from code unit or code point iterators with appropriate assurances that a grapheme cluster hasn't been split?

I believe, that to reach most programmers, we need to provide simple types that, for most purposes, just do the right thing by default, but expose the underlying data as needed for experts. Think of a need to search some text for a particular "character". Let's say the character to match is a member of the basic source character set, 'X' for example. If the programmer has to be aware that 'X' can have combining code points and that the grapheme cluster interfaces must therefore be used unless matching a base character with combining character(s) is desired, then we've already lost. We need to ensure that the result of something like find(t, 'X') is 1) useful, and 2) the right choice for most purposes. I think the answer is that such a function should 1) match grapheme clusters (not code points, certainly not code units), and 2) return a character_reference (or similar) type from which a grapheme cluster iterator can be obtained (and then lowered to a code point or code unit iterator if needed).

Within SG16, consensus has been moving towards making text and text_view grapheme cluster based (perhaps with appropriate abstractions for legacy encoding support), but with access to the code unit and code point sequence exposed and the ability to manipulate the code unit sequence (preferably by lowering a grapheme cluster or code point iterator to a code unit iterator so that boundaries are maintained properly for all encoding levels).

ghost commented 6 years ago

Programmers that aren't experts in Unicode don't want to have to worry about these distinctions

Yes, iterating grapheme clusters would be the least surprising behavior to novice programmers. This would be the rare occurrence of string type not being broken.

If all of the encoding layers are on equal ground, can I safely and efficiently convert iterators across them? If I have a code point iterator, how do I call a function that was written to take a grapheme cluster iterator?

Converting to the lower level is trivial while converting to upper is not. We will need helper functions to do this.

If I have a code point iterator, how do I call a function that was written to take a grapheme cluster iterator? The answer can't be, don't do that, pass a text_view object around instead because how do I construct a text_view object from code unit or code point iterators with appropriate assurances that a grapheme cluster hasn't been split?

Consider std algorithms such as std::for_each, std::distance, std::rotate, etc. They need a pair of iterators or a range in the future, they don't need to know what text is and its layers. We will need a type for each layer so that the code can stay generic.

If the programmer has to be aware that 'X' can have combining code points and that the grapheme cluster interfaces must therefore be used unless matching a base character with combining character(s) is desired, then we've already lost.

No, the programmer lost. I don't want to hide bugs until later time. Look at what raw pointers and basic_string has done - infinite number of bugs that cost insane amount of money and manpower to maintain. Again, yes, if text_view is just a novice-friendly name for grapheme_cluster_sequence_view then I'm fine.

I've implemented CodePointSequence for my purposes and after looking at Boost.Text and text_view paper I think this design is the most promising:

template <TextEncoding ET, std::endian Endianness = std::endian::native, typename Allocator = std::allocator<std::byte>>
class code_unit_sequence;

template <TextEncoding ET, std::endian Endianness = std::endian::native>
class code_unit_sequence_view;

template <typename T>
concept bool CodeUnitSequence();
template <typename T>
concept bool CodeUnitSequenceView();

template <CodeUnitSequence Container, TextEncoding ET = default_encoding_type_t<Container>>
class code_point_sequence;

template <CodeUnitSequenceView VT, TextEncoding ET = default_encoding_type_t<VT>>
class code_point_sequence_view;

I think having separate big-endian and little-endian encodings is not useful. Endianness matters only at the byte level so there should be class templates that handle it. code_unit_sequence and code_unit_sequence_view are such classes. basic_string can be fine too although it would be limited to native endianness (or extremely complex char_traits that nobody will bother with). The rest of the code should be completely agnostic to endianness.

code_point_sequence builds on top of CodeUnitSequence concept. It will provide bidirectional iterators that will return proxy type that will reallocate the underlying buffer in case the number of code units in referenced code point position changes.

code_point_sequence_view is trivial.

rmartinho commented 6 years ago

Yes, iterating grapheme clusters would be the least surprising behavior to novice programmers. This would be the rare occurrence of string type not being broken.

Funnily enough, Swift does this, and their string type is currently broken because of it https://bugs.swift.org/browse/SR-375.

ghost commented 6 years ago

I think that string type that has .characters.count is fundamentally broken.

Also I wanted to say that I would like to implement code_unit_sequence and code_point_sequence and produce a formal paper. I just want a blessing.

rmartinho commented 6 years ago

I think that string type that has .characters.count is fundamentally broken.

Well, that's just the most trivial way of demonstrating how it's broken. Iterating over Swift strings also produces similarly broken results.

Also I wanted to say that I would like to implement code_unit_sequence and code_point_sequence and produce a formal paper. I just want a blessing.

Come join us on Slack https://cpplang.slack.com/messages/sg16-unicode, or the mailing list http://www.open-std.org/mailman/listinfo/unicode, or even join our next teleconference http://www.open-std.org/pipermail/unicode/2018-June/000037.html

tahonermann commented 6 years ago

I just want a blessing.

@Lyberta No blessing is necessary of course. We (SG16) have been wrestling with this question for a while now, but haven't made any decisions one way or another. The next pre-meeting mailing is quite some time away and I do plan on scheduling time to discuss code points vs grapheme clusters at our meetings in the not too distant future. So, I'll echo what Martinho said; join us on Slack, the mailing list, and our telecons (invite info is on the mailing list and I can send you an invite on request if you like). You'll get more immediate feedback and be better able to contribute to our direction than by writing a paper (at least in the short term). I do think we'll want to write a paper on this subject at some point, but I think it would be great if it were collaboratively developed within SG16 with a goal of presenting an agreed upon approach with pros/cons to the rest of the committee.

tahonermann commented 4 years ago

I'm closing this issue as non-actionable since there does not appear to be consensus for a particular direction. The concerns raised will need to be addressed as part of https://github.com/sg16-unicode/sg16/issues/31. Anyone wishing to propose a specific solution is encouraged to open a new issue or to submit a paper.