mattn / go-runewidth

wcwidth for golang
MIT License
609 stars 94 forks source link

RuneWidth does not equal StringWidth #48

Closed markus-oberhumer closed 1 year ago

markus-oberhumer commented 3 years ago

I stumbled over this while working on #47.

It seems that RuneWidth is not always equal to the StringWidth of a single rune.

This is quite unexpected, TBH.

Please see https://github.com/markus-oberhumer/mattn--go-runewidth/commit/5da511d36b1ea1ad913590b7b27357e5fffd3512 for a test case.

markus-oberhumer commented 3 years ago

CC @rivo for feedback if this is intended

rivo commented 3 years ago

I'm not sure if iterating through all runes numerically will give you valid unicode code points. Maybe you can post an example of a valid code point that fails your test. Then we could look at what the problem is.

markus-oberhumer commented 3 years ago

Please try

git clone https://github.com/markus-oberhumer/mattn--go-runewidth.git go-runewidth-test
cd go-runewidth-test
git checkout test-RuneWidthEqualsString1Width
go test 

=> failures in the range [55296, 57343] aka [0xd800, 0xdfff]

I'm aware that these are invalid, but I think we need consistency between RuneWidth and StringWidth here.

rivo commented 3 years ago

[0xd800, 0xdfff] are Unicode surrogate ranges. There are no characters in this range. I don't even know what the "width" of those code points is supposed to be. I don't understand the reason for your requirement.

markus-oberhumer commented 3 years ago

Basically I'm saying that RuneWidth(0xd800) and RuneWidth(utf8.RuneError) do not agree as the former is in the table nonprint while the latter is not.

And because string(rune(0xd800)) is internally converted to string(rune(utf8.RuneError)) we get inconsistencies.

rivo commented 3 years ago

Well, as you point out, the problem here is the conversion. If Go places different runes into the string than what you put in, you end up comparing apples to oranges. I implemented the current version of StringWidth() and it does what it's told: It gives you the width of the string, which consists in this case of 0xfffd. But what you're passing to RuneWidth() is 0xd800 so it's not the same. I would say it's pretty accurate that the test fails because the string and the rune are different. You may want to exclude all test cases where the string conversion results in utf8.RuneError.

On a side note, what one could think about is to add the same handling of invalid characters to RuneWidth(). But that's @mattn's decision so I can't comment on that.

markus-oberhumer commented 1 year ago

I think this issue can be closed.