golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.66k stars 17.62k forks source link

proposal: unicode: Graphemes and GraphemesReversed #69804

Open leb-kuchen opened 2 weeks ago

leb-kuchen commented 2 weeks ago

Proposal Details

I propose to add the functions Graphemes and GraphemesReversed to the packages strings and bytes unicode [Edited by adonovan, Oct 8].

14820 proposes to add such functionality to /x/text, but iterators have made this function easier to write and use, so I think these function belong in std.

First, a few tables are needed, but these are worth their space.

These mostly consist of other tables, so it can be optimized. https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Break_Property_Values Also these can be used in regexp, for example this regex can used to match a grapheme: \p{gcb=CR}\p{gcb=LF}|\p{gcb=Control}|\p{gcb=Prepend}*(((\p{gcb=L}*(\p{gcb=V}+|\p{gcb=LV}\p{gcb=V}*|\p{gcb=LVT})\p{gcb=T}*)|\p{gcb=L}+|\p{gcb=T}+)|\p{gcb=RI}\p{gcb=RI}|\p{Extended_Pictographic}(\p{gcb=Extend}*\p{gcb=ZWJ}\p{Extended_Pictographic})*|[^\p{gcb=Control}\p{gcb=CR}\p{gcb=LF}])[\p{gcb=Extend}\p{gcb=ZWJ}\p{gcb=SpacingMark}]*|\p{any}.

// https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt
var PREPEND = &unicode.RangeTable{}
var CONTROL = &unicode.RangeTable{}
var EXTEND = &unicode.RangeTable{}
var SPACING_MARK = &unicode.RangeTable{}
var REGIONAL_INDICATOR = &unicode.RangeTable{}
var L = &unicode.RangeTable{}
var V = &unicode.RangeTable{}
var T = &unicode.RangeTable{}
var LV = &unicode.RangeTable{}
var LVT = &unicode.RangeTable{}
// https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt
var EXTENDED_PICTOGRAPHIC = &unicode.RangeTable{}
// https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt
var INCB_LINKER = &unicode.RangeTable{}
var INCB_CONSONANT = &unicode.RangeTable{}
var INCB_EXTEND = &unicode.RangeTable{}

These are some constants and a helper function.

func RunesReversed(s string) iter.Seq2[int, rune] {
    return func(yield func(int, rune) bool) {
        for i := len(s); i > 0; {
            r, size := utf8.DecodeLastRuneInString(s[0:i])
            i -= size
            if !yield(i, r) {
                return
            }
        }
    }
}

const LF = '\n'
const CR = '\r'
const ZWJ = '\u200d'

And then you can concatenate the files and generate these tables:

var re = regexp.MustCompile(`(?m)^(?<startRange>[[:xdigit:]]+)\.{0,2}(?<endRange>[[:xdigit:]]+)?\s*;\s*(?<property>\w+)(?:\s*;s*(?<subProperty>\w+))?`)

func gen() {
    matches := re.FindAllStringSubmatch(GBP_TXT, -1)
    for _, match := range matches {
        startRange, err1 := strconv.ParseUint(match[1], 16, 32)
        endRange, err2 := strconv.ParseUint(match[2], 16, 32)
        if err1 != nil {
            panic("should not be")
        }
        if err2 != nil {
            endRange = startRange
        }
        newRange := make([]rune, 0)
        for r := startRange; r <= endRange; r++ {
            newRange = append(newRange, rune(r))
        }
        var rangeTable *unicode.RangeTable

        switch match[3] {
        case "InCb":
            switch match[4] {
            case "Consonant", "Extend", "Linker":
            default:
                continue
            }
        case "Prepend":
            rangeTable = PREPEND
        case "Control":
            rangeTable = CONTROL
        case "Extend":
            rangeTable = EXTEND
        case "Regional_Indicator":
            rangeTable = REGIONAL_INDICATOR
        case "SpacingMark":
            rangeTable = SPACING_MARK
        case "L":
            rangeTable = L
        case "V":
            rangeTable = V
        case "T":
            rangeTable = T
        case "LV":
            rangeTable = LV
        case "LVT":
            rangeTable = LVT
        case "Extended_Pictographic":
            rangeTable = EXTENDED_PICTOGRAPHIC
        default:
            continue
        }
        newRangeTable := rangetable.New(newRange...)
        *rangeTable = *rangetable.Merge(rangeTable, newRangeTable)
    }
}

I created a sample implementation that focuses on easy readability and understandability, so I left out caching and ASCII optimizations.

func Graphemes(s string) iter.Seq2[int, string] {
    return func(yield func(int, string) bool) {
        if s == "" {
            return
        }
        currIdx := 0
        lowIdx := 0
        currSize := 0
    loop:
        for ; ; currIdx += currSize {
            currRune, size := utf8.DecodeRuneInString(s[currIdx:])
            currSize = size
            nextIdx := currIdx + currSize
            if nextIdx >= len(s) {
                goto ret
            }
            nextRune, _ := utf8.DecodeRuneInString(s[nextIdx:])
            if currRune == CR && nextRune == LF {
                continue
            }
            if unicode.Is(CONTROL, currRune) || currRune == CR || currRune == LF {
                goto isBreak
            }
            if unicode.Is(CONTROL, nextRune) || nextRune == CR || nextRune == LF {
                goto isBreak
            }
            switch {
            case unicode.In(currRune, L) && unicode.In(nextRune, L, V, LV, LVT):
                continue
            case unicode.In(currRune, L, V) && unicode.In(nextRune, V, T):
                continue
            case unicode.In(currRune, LVT, T) && unicode.In(nextRune, T):
                continue
            }
            if unicode.Is(EXTEND, nextRune) || nextRune == ZWJ || unicode.Is(SPACING_MARK, nextRune) {
                continue
            }
            if unicode.Is(PREPEND, currRune) {
                continue
            }
            if unicode.Is(INCB_CONSONANT, nextRune) {
                inCbLinkerCount := 0
                for _, prevRune := range RunesReversed(s[:nextIdx]) {
                    if unicode.Is(INCB_LINKER, prevRune) {
                        inCbLinkerCount += 1
                        continue
                    }
                    if unicode.Is(INCB_EXTEND, prevRune) {
                        continue
                    }
                    if unicode.Is(INCB_CONSONANT, prevRune) && inCbLinkerCount > 0 {
                        continue loop
                    }
                    break
                }
            }

            if currRune == ZWJ && unicode.Is(EXTENDED_PICTOGRAPHIC, nextRune) {
                for _, prevRune := range RunesReversed(s[:currIdx]) {
                    if unicode.Is(EXTENDED_PICTOGRAPHIC, prevRune) {
                        continue loop
                    }
                    if unicode.Is(EXTEND, prevRune) {
                        continue
                    }
                    break
                }
            }
            if unicode.Is(REGIONAL_INDICATOR, currRune) && unicode.Is(REGIONAL_INDICATOR, nextRune) {
                riCount := 1
                for _, prevRune := range RunesReversed(s[:currIdx]) {
                    if unicode.Is(REGIONAL_INDICATOR, prevRune) {
                        riCount += 1
                        continue
                    }
                    break
                }
                if riCount%2 != 0 {
                    continue
                }
            }
            goto isBreak
        isBreak:
            if !yield(lowIdx, s[lowIdx:nextIdx]) {
                return
            }
            lowIdx = nextIdx
        }
    ret:
        yield(lowIdx, s[lowIdx:])

    }
}
func GraphemesReversed(s string) iter.Seq2[int, string] { /* The same, but backwards  */}
gabyhelp commented 2 weeks ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

robpike commented 2 weeks ago

This belongs in x/text. It's too wrapped up in Unicode details to belong in the low-level strings package.

leb-kuchen commented 2 weeks ago

The rules are actually not that difficult. It is just this table I've implemented. https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules currRune (×| ÷) nextRune Unicode standards like case also have caveats and are in std. I think graphemes are useful enough to be unicode or strings

ianlancetaylor commented 2 weeks ago

Can you point to some code that would use these new functions? Bear in mind https://go.dev/doc/faq#x_in_std. Thanks.

adonovan commented 2 weeks ago

The rules are actually not that difficult. It is just this table I've implemented. https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules currRune (×| ÷) nextRune Unicode standards like case also have caveats and are in std. I think graphemes are useful enough to be unicode or strings

That table alone exceeds the conceptual complexity of the rest of the strings package, which must not depend on the Unicode character class tables. The right place for this change (if anywhere) is the unicode package. I'll retitle the issue to redirect the most obvious criticism.

apparentlymart commented 1 week ago

I wrote github.com/apparentlymart/go-textseg/textseg to fill this gap for some of my own projects. Unfortunately because my repository contains some Unicode-licensed content pkg.go.dev won't render the docs :man_shrugging:, but the list of importing modules might be interesting to help answer the question of what kind of code might make use of this.

A big chunk of the early work there was dealing with the fact that the relevant character tables were not already exported from anywhere as unicode.RangeTable. I did include unicode.RangeTable values generated from the source data, but since I was generating things anyway I ended up implementing the actual tokenizer in terms of a Ragel-language transform of the raw data, rather than using the range tables, since that allowed dealing with the UTF-8 recognition and grapheme boundary recognition all at once in a single state machine.

I also have a suite of tests that were mechanically generated from the test data provided by Unicode, in case that's useful for cross-checking a new implementation.

If this were in the standard library or in x/text then I would likely cease development of my module. However, developing that module has not been a major workload since usually it's just a matter of obtaining the latest version of the tables from Unicode and re-running the generators. There was one major version of Unicode that significantly changed the algorithm, but since then only the tables have changed.