mattn / go-runewidth

wcwidth for golang
MIT License
608 stars 92 forks source link

Width is 1 when it should be 2 #59

Open rivo opened 2 years ago

rivo commented 2 years ago

I stumbled over a character that, when output to the console directly, takes up two characters. But StringWidth() gives me 1. This is because the first rune of this character has a width of 1 and that's what's being used, see here. I know I wrote this code and I'm sure that you cannot simply add up the widths of individual runes ("🏳️‍🌈" would then have a width of 4 which is obviously wrong) and using the first rune's width worked fine so far. But it turns out that it fails in some cases.

I'm not familiar with Indian characters but it seems to me that the second rune is a modifier that turns the character from a width of 1 into a width of 2. Are you aware of any logic that we could add to go-runewidth that makes this right?

Here's example code that illustrates the issue:

package main

import (
    "fmt"

    runewidth "github.com/mattn/go-runewidth"
)

func main() {
    s := "खा"
    fmt.Println("0123456789")
    fmt.Println(s + "<")
    fmt.Printf("String width: %d\n", runewidth.StringWidth(s))
    var i int
    for _, r := range s {
        fmt.Printf("Rune %s  (%d) width: %d\n", string(r), i, runewidth.RuneWidth(r))
        i++
    }
}

Output (on macOS with iTerm2):

image
dricottone commented 2 years ago

I don't think I've ever used StringWidth. I was introduced to this library by way of tcell and it's demos. You can see here that another strategy is to use RuneWidth in a loop.

And incidentally that code looks similar to the implementation of StringWidth, but without a lot of (probably important) edge case handling.

rivo commented 2 years ago

Yeah, we've been through this before. Just looking at individual runes is not enough to determine the width of a string. Characters can consist of multiple runes ("🏳️‍🌈" is composed of 4 runes and 14 bytes). StringWidth currently uses only the width of the first rune, which is a rule I implemented because I didn't know better. Now we know that this is not correct in some cases. But to be honest, I don't even know where to look for the correct rule. The width of characters on terminal screens is not part of any official specification.

I guess I'll have to dig into wcwidth.c which appears to be used by most terminals.

ericwq commented 2 years ago

runwidth can handle east asia character, but emoj/flag is very different. should be handled differently. https://unicode.org/emoji/charts/emoji-list.html

rivo commented 2 years ago

It seems that runewidth does not handle East Asian characters correctly, as you can see in the initial comment of this thread. I tried to look at other implementations of this but wasn't able to find a good answer as to how other packages solve this issue. The question remains, how these (and emoji) characters should be handled when composed of multiple runes.

mattn commented 2 years ago

खा 0x0916 is not Ambiguous (A). It is Neutral (N).

https://unicode.org/Public/13.0.0/ucd/EastAsianWidth.txt

image

rivo commented 2 years ago

What does this mean for the issue at hand? Can you elaborate?

mattn commented 2 years ago

If the field is W, runewidth.StringWidth return 2. Table of runewidth is generated from EastAsianWidth.txt. So runewidith works correctly, I think.

rivo commented 2 years ago

If the field is W, runewidth.StringWidth return 2.

I don't know what you mean by that. As I mentioned above, it currently returns 1 for the character at hand.

So are you saying we should add up the rune widths of all runes this character is composed of? Then how do we deal with emojis?

(I know I keep repeating myself here but maybe I'm not expressing myself clear enough.)

mattn commented 2 years ago

Ah, sorry. I did misreading. I didn't understand खा have two runes.

ericwq commented 2 years ago

Here is my fix. It's not perfect, but it works for me, tested with combining character, chinese, emoji, flags.

// In the loop, national flag's width got 1+1=2.
func runesWidth(runes []rune) (width int) {
    // quick pass for iso8859-1
    if len(runes) == 1 && runes[0] < 0x00fe {
        return 1
    }

    cond := runewidth.NewCondition()
    cond.StrictEmojiNeutral = false
    cond.EastAsianWidth = true

    // loop for multi rune
    width = 0
    for i := 0; i < len(runes); i++ {
        width += cond.RuneWidth(runes[i])
    }

    return width
}

There is a precondition: You must use uniseg to split it into multi runes.

graphemes := uniseg.NewGraphemes(str)
for graphemes.Next() {
    input = graphemes.Runes()
}
rivo commented 2 years ago

I just tried this and it gives me:

StringWidth("🏳️‍🌈") == 6

where it should be 2.

I think we're all underestimating this topic. If there's one thing I learned, it's that Unicode handling is complicated. Especially for emojis, there are various ways of constructing them, and simply adding up code point widths (or using the width of the first code point, like I suggested earlier) will not work.

I guess I have to do the grunt work and get into all the details.

rivo commented 2 years ago

Ok, so I went down the rabbit hole of calculating the monospace width of strings, like this package attempts. Unfortunately, the whole premise of doing this code point by code point (or, in Golang terms, rune by rune) will never lead to satisfactory results. It works ok for non-English Characters, e.g. 世界 or खा. You can simply add up the width of all individual runes.

But that completely fails for emojis. If you look at Unicode Technical Standard #51, you'll find that emojis can be encoded in tons of different ways. Zero-Width Joiner is just one small part of it. There are modifiers that force text presentation (=width of 1), modifiers that force normal characters like the digit "1" into emoji presentation (=width of 2), there are regional indicators (flags), complex emoji sequences, just to name a few.

To handle these correctly, we need to look at grapheme clusters as a whole. We already attempted this in this package, but it requires a lot more analysis of these clusters than what's happening now. I don't actually know how the original wcwidth handles emojis but similar packages like the Ruby version, janlelis/unicode-display_width, makes that extension optional, i.e. there is special handling for emojis. I also found that many terminals will render some emojis with the wrong width. So I guess wcwidth (presumably used by most of them) still has some flaws.

By now, I've implemented my own version of wcwidth in rivo/uniseg. uniseg already has all the tools in place for grapheme cluster parsing, code point classification etc. You can check out the algorithm here.

As for this package and this issue, I think there is value in being compatible with the original wcwidth functionality, as this will allow us to write Go programs that work on most terminals (assuming most terminals use wcwidth). If someone has the time, please research how wcwidth handles emojis. In my opinion, this would be the optimal solution.

For a quick fix, though, I would suggest the following:

Within a grapheme cluster, add up the widths of all individual runes. But if the first rune is an emoji (Extended Pictographic or Regional Indicator), always return a total width of 2.

This is not perfect but it will work for most use cases.

If you want to do emoji handling in a better way, another quick fix would be:

Within a grapheme cluster, add up the widths of all individual runes. But if the first rune is an emoji (Extended Pictographic or Regional Indicator), use the width returned by rivo/uniseg.

We're already using rivo/uniseg to determine the grapheme clusters so this will come for free. (This will still not handle conjoining Hangul characters correctly but maybe that's ok.)

If you want to get rid of rivo/uniseg again, sure, you can go back to adding up all rune widths again like it was in the beginning of this package. But many emojis will have the wrong width then. It's not something I would suggest but this is not my package, I'm not the one to decide.

rivo commented 2 years ago

I updated PR #63 to reflect this. It's the "quick fix" solution I described above.

ericwq commented 2 years ago

glad to hear that. i will give it a try if i find some time.

mattn commented 2 years ago

Thanks. I'll look into it in later.

ericwq commented 1 year ago

@rivo It works perfectly. Thanks for your contribution!