Open leb-kuchen opened 1 month ago
Related Issues and Documentation
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
This belongs in x/text. It's too wrapped up in Unicode details to belong in the low-level strings package.
The rules are actually not that difficult. It is just this table I've implemented.
https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules
currRune (×| ÷) nextRune
Unicode standards like case also have caveats and are in std. I think graphemes are useful enough to be unicode or strings
Can you point to some code that would use these new functions? Bear in mind https://go.dev/doc/faq#x_in_std. Thanks.
The rules are actually not that difficult. It is just this table I've implemented. https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules
currRune (×| ÷) nextRune
Unicode standards like case also have caveats and are in std. I think graphemes are useful enough to be unicode or strings
That table alone exceeds the conceptual complexity of the rest of the strings package, which must not depend on the Unicode character class tables. The right place for this change (if anywhere) is the unicode package. I'll retitle the issue to redirect the most obvious criticism.
I wrote github.com/apparentlymart/go-textseg/textseg
to fill this gap for some of my own projects. Unfortunately because my repository contains some Unicode-licensed content pkg.go.dev
won't render the docs :man_shrugging:, but the list of importing modules might be interesting to help answer the question of what kind of code might make use of this.
A big chunk of the early work there was dealing with the fact that the relevant character tables were not already exported from anywhere as unicode.RangeTable
. I did include unicode.RangeTable
values generated from the source data, but since I was generating things anyway I ended up implementing the actual tokenizer in terms of a Ragel-language transform of the raw data, rather than using the range tables, since that allowed dealing with the UTF-8 recognition and grapheme boundary recognition all at once in a single state machine.
I also have a suite of tests that were mechanically generated from the test data provided by Unicode, in case that's useful for cross-checking a new implementation.
If this were in the standard library or in x/text then I would likely cease development of my module. However, developing that module has not been a major workload since usually it's just a matter of obtaining the latest version of the tables from Unicode and re-running the generators. There was one major version of Unicode that significantly changed the algorithm, but since then only the tables have changed.
A lot of software that is difficult to get right initially in Go can be enabled by this:
It really feels like an oversight given the great support of runes and lots of developers using runes when actually grapheme clusters are needed instead and thus writing buggy sofware in a lot of enterprise level projects.
Proposal Details
I propose to add the functions Graphemes and GraphemesReversed to the packages
strings and bytesunicode [Edited by adonovan, Oct 8].14820 proposes to add such functionality to /x/text, but iterators have made this function easier to write and use, so I think these function belong in std.
First, a few tables are needed, but these are worth their space.
These mostly consist of other tables, so it can be optimized. https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Break_Property_Values Also these can be used in regexp, for example this regex can used to match a grapheme:
\p{gcb=CR}\p{gcb=LF}|\p{gcb=Control}|\p{gcb=Prepend}*(((\p{gcb=L}*(\p{gcb=V}+|\p{gcb=LV}\p{gcb=V}*|\p{gcb=LVT})\p{gcb=T}*)|\p{gcb=L}+|\p{gcb=T}+)|\p{gcb=RI}\p{gcb=RI}|\p{Extended_Pictographic}(\p{gcb=Extend}*\p{gcb=ZWJ}\p{Extended_Pictographic})*|[^\p{gcb=Control}\p{gcb=CR}\p{gcb=LF}])[\p{gcb=Extend}\p{gcb=ZWJ}\p{gcb=SpacingMark}]*|\p{any}
.These are some constants and a helper function.
And then you can concatenate the files and generate these tables:
I created a sample implementation that focuses on easy readability and understandability, so I left out caching and ASCII optimizations.