golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.03k stars 17.67k forks source link

proposal: utf8: RuneStartLen to get the length of the rune from the first byte #68716

Open aymanbagabas opened 3 months ago

aymanbagabas commented 3 months ago

Proposal Details

I find myself in need of such a method to determine how many bytes in a UTF-8 string when iterating over bytes. Following RFC 3629, we can implement something like utf8.RuneStartLen(b byte) int.

Zig and Rust have these implemented to provide this functionality. Go could have something like this to do the same.

// RuneStartLen reports the number of bytes an encoded rune will have. It
// returns a value between 1-4, or -1 if the byte is not a valid UTF-8 first
// byte.
func RuneStartLen(b byte) int {
    if b <= 0b0111_1111 { // 0x00-0x7F
        return 1
    } else if b >= 0b1111_0000 { // 0xF0-0xF7
        return 4
    } else if b >= 0b1110_0000 { // 0xE0-0xEF
        return 3
    } else if b >= 0b1100_0000 { // 0xC0-0xDF
        return 2
    }
    return -1
}
gabyhelp commented 3 months ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

adonovan commented 3 months ago

This is a reasonable function, but it is rarely needed except by clients that are doing something unusually sophisticated, and it's a trivial consequence of the four constants that appear in the compact pictorial summary of UTF-8 found in any document on the subject--especially if you simplify each else if cond1 && cond2 to else if cond2. (Each first condition is trivially true as a consequence of the control flow.)