golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.98k stars 17.53k forks source link

unicode/utf16: Add example on how to use utf16.DecodeRune #65498

Open soypat opened 7 months ago

soypat commented 7 months ago

Go version

go version go1.21.4 linux/amd64

Output of go env in your module/workspace:

N/A

What did you do?

Opened https://pkg.go.dev/unicode/utf16#DecodeRune

What did you see happen?

No examples on the page.

What did you expect to see?

An example on how to use DecodeRune to decode a []uint16 without performing heap allocations, similar to how utf8.DecodeRune works.

soypat commented 7 months ago

From my understanding of the package, this is what I came up with to encode strings to and from utf8<->utf16. I had to dig into the utf16 package internals to write this code and copy paste some of it though, specifically for the first function.

Leaving them here since they would be nice additions to show how to use the utf16 package with its more commonly used counterpart, utf8.


func encodeUTF16to8(dstUTF8, srcUTF16 []byte, order16 binary.ByteOrder) (int, error) {
    // UTF16 values.
    const (
        // 0xd800-0xdc00 encodes the high 10 bits of a pair.
        // 0xdc00-0xe000 encodes the low 10 bits of a pair.
        // the value is those 20 bits plus 0x10000.
        surr1 = 0xd800
        surr2 = 0xdc00
        surr3 = 0xe000

        surrSelf = 0x10000
    )
    n := 0
    var r1, r2 rune
    for {
        slen := len(srcUTF16)
        if slen == 0 {
            break
        }
        r1 = rune(order16.Uint16(srcUTF16))
        if slen >= 4 {
            r2 = rune(order16.Uint16(srcUTF16[2:]))
        }
        var ar rune
        switch {
        case r1 < surr1, surr3 <= r1:
            // normal rune
            ar = r1
            srcUTF16 = srcUTF16[2:]
        case surr1 <= r1 && r1 < surr2 && slen >= 4 &&
            surr2 <= r2 && r2 < surr3:
            // valid surrogate sequence
            ar = utf16.DecodeRune(r1, r2)
            srcUTF16 = srcUTF16[4:]
        default:
            // invalid surrogate sequence
            return n, errors.New("invalid utf16")
        }
        // Encode the rune into UTF-8.
        if utf8.RuneLen(ar) > len(dstUTF8[n:]) {
            return n, errors.New("insufficient utf8 buffer")
        }
        n += utf8.EncodeRune(dstUTF8[n:], ar)
    }
    return n, nil
}

func encodeUTF8to16(dst16, src8 []byte, order16 binary.ByteOrder) (int, error) {
    n := 0
    for len(src8) > 0 {
        r1, size := utf8.DecodeRune(src8)
        src8 = src8[size:]
        switch {
        case utf16.IsSurrogate(r1):
            // Surrogate pair case.
            if len(dst16) < 4 {
                return n, errors.New("insufficient utf16 buffer")
            }
            r1, r2 := utf16.EncodeRune(r1)
            order16.PutUint16(dst16[n:], uint16(r1))
            order16.PutUint16(dst16[n+2:], uint16(r2))
            n += 4
        default:
            // General case.
            if len(dst16) < 2 {
                return n, errors.New("insufficient utf16 buffer")
            }
            // Simplest case for ASCII characters.
            order16.PutUint16(dst16[n:], uint16(r1))
            n += 2
        }
    }
    return n, nil
}
robpike commented 7 months ago

While I appreciate your desire to avoid heap allocation, all the uses of unicode/utf16 do the obvious conversion from []rune returned by this package into a string. It's easy and fast and very little code. If there's a bottleneck there, I'd like to see it in real life.

The unicode/utf8 package does not have cross-conversions like the one you suggest, although to be fair it doesn't really need them as the language supports that encoding directly.

In short, there seems little need for the routines you propose to add to the library.

I do believe that examples would be nice, but they should demonstrate the idiomatic conversion that everyone seems to use and not the complex code you show here.

soypat commented 7 months ago

Just to clarify my poorly worded comment: I meant add the routines as an example so that they appear in pkg.go.dev.

So as it turns out, there's not a bottleneck in say "real" Go code. Like you say, Go's garbage collector is state of the art and doing the obvious conversion would most likely work fine. The issue lies in allocating with TinyGo on a microcontroller where RAM is very limited and memory can easily get fragmented and eventually crash your program.

I understand TinyGo is not Go and that a more elegant fix would be to create a more robust GC in TinyGo, but that is a daunting task.

All this said, while I'm not for adding these utf16-utf8 conversion routines as part of the package but rather as examples of usage, there is one part of the utf16 internals I'd very much like exposed. I've created a proposal here: https://github.com/golang/go/issues/65511

Edit: I've noticed that adding the routine proposed in #65511 would simplify one of the conversion functions greatly:

```go func encodeUTF16to8(dstUTF8, srcUTF16 []byte, order16 binary.ByteOrder) (int, error) { n := 0 for len(srcUTF16) > 1 { r, size := utf16.DecodeBytes(srcUTF16, order16) if r == utf8.RuneError { return n, errors.New("invalid utf16 sequence") } srcUTF16 = srcUTF16[size:] n += utf8.EncodeRune(dstUTF8[n:], r) } return n, nil } ```