mattn / go-runewidth

wcwidth for golang
MIT License
609 stars 94 forks source link

Fixed StringWidth() implementation by using proper Unicode grapheme cluster segmentation. Fixes #28 #29

Closed rivo closed 3 years ago

rivo commented 5 years ago

This introduces an implementation of StringWidth() using Unicode grapheme clusters which should be the correct way to split a string into its individual characters. The built-in assumption is that if we have combined runes (emojis, flags etc.), their width is the width of the first non-zero-width rune. Many of these combined runes were previously not handled correctly by this package.

Please note:

mattn commented 5 years ago

Thanks your contribution. I'm thinking there are terminals which does not support ZWJ yet. So I prefer to disable ZWJ.

mattn commented 5 years ago

How to add enable/disable ZWJ?

rivo commented 5 years ago

Using rivo/uniseg means we're doing proper grouping of runes now. There are many more rules than just the handling of a zero-width joiner, see here. As an example, flags such as ๐Ÿ‡ฏ๐Ÿ‡ต consist of two runes but don't contain zero-width joiners, they follow other rules.

If you want to support terminals that don't render these characters, the question is how do they render them? If the difference is that some terminals render characters rune-by-rune, then a flag like UnicodeSegmentation would be more appropriate.

But singling out the zero-width joiner rule (which is only one of 13 rules) is probably not very useful.

Let me know what you think about this.

mattn commented 5 years ago

Most of terminals which does not support ZWJ displays Unicode characters as what we can see. For example, ๐Ÿ‘ช is displayed as ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆ.

rivo commented 5 years ago

So you're saying that these terminals support all kinds of Unicode characters (e.g. flags, Hangul, spacing/prepend marks) but ZWJ is the one they don't support?

Do you have an example for such a terminal?

mattn commented 5 years ago

For example, gnome-terminal displays ๐Ÿ‘ช as ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆ.

mattn commented 5 years ago

Also ZeroWidthJoiner is already used in some projects.

https://github.com/4avka/tview/blob/d7a9d6c70ab205d578fc64f02f4e681ccb5b5180/util.go#L56

rivo commented 5 years ago

What about แ„…แ…ฌแ†ซ, ๐Ÿ‡ฉ๐Ÿ‡ช, oฬˆ, or ๐Ÿณ๏ธโ€๐ŸŒˆ? How do these display in a gnome terminal?

(tview is my project. I will remove the ZWJ handling. I only used it becausego-runewidth required setting it.)

mattn commented 5 years ago

I'll check it in later. But probably

แ„…แ…ฌแ†ซ is 2 cells ๐Ÿ‡ฉ๐Ÿ‡ช is 2 cells oฬˆ is 1 or 2 cells ๐Ÿณ๏ธโ€๐ŸŒˆ is 4 cells

rivo commented 5 years ago

Ok. Let me know when you got it.

image

Actually, here on GitHub, your answer suggests gnome terminal uses ZWJ properly. Both emojis look identical here.

mattn commented 5 years ago

What I look is:

image

rivo commented 5 years ago

Interesting. Well, let me know what the result of these other characters is.

If we need to be able to turn off individual Unicode breaking rules, then that functionality needs to be implemented in uniseg as well. Actually, in my opinion, it would even be better to set these flags not here in go-runewidth but in uniseg because these are grapheme breaking flags and that's the package that does it. It would then automatically affect your package, too.

I'll wait for the widths of the characters you wanted to look up and we'll see then how to proceed.

alecrabbit commented 5 years ago

What about แ„…แ…ฌแ†ซ, ๐Ÿ‡ฉ๐Ÿ‡ช, oฬˆ, or ๐Ÿณ๏ธโ€๐ŸŒˆ? How do these display in a gnome terminal?

If it helps here how these chars look in my terminal(Ubuntu 18.04/Gnome)

// Character StringWidth uniseg.Graphemes image

php-wcwidth's algo gets incorrect width for แ„…แ…ฌแ†ซ -> 4, in terminal it displayed with 2

mattn commented 5 years ago

@rivo Can you revert removing ZeroWidthJoiner?

mattn commented 4 years ago

Any thought?

rivo commented 4 years ago

Apologies for the late reply. And thanks to @alecrabbit for providing the screenshot. Well, reintroducing the ZeroWidthJoiner (ZWJ) flag will not solve the issue with the family icon (๐Ÿ‘ช) because it doesn't even contain a ZWJ code point (your own package, prior to this PR, gives 2 as a width, no matter how the flag is set). As mentioned before, the issue is larger than just the zero-width joiner topic. I.e. reverting will not solve much.

Also, the rivo/uniseg package has no way to disable individual rules so I don't even know how I would implement a ZWJ deactivation. Especially because it could have side effects on other rules.

If you insist on keeping that flag in there, I think you'll need to ignore this pull request.

I think it's generally difficult or even impossible to make this package work on all platforms. One would have to keep a database of how the different platforms render different characters. Since new emojis are introduced all the time, that's whole bunch of work. And it seems that not even popular packages get this right.

Here's iTerm on macOS 10.14:

image

Same with Apple's Terminal:

image

In short, I have no solution. There will always be someone who will complain. Unless someone is willing to put the effort into building a huge database of terminals+OSs and how they render characters, there will be mistakes.

But personally, I think this PR gets us closer to the correct way to handle this than the current version.

mattn commented 4 years ago

Could you please fix conflict?

rivo commented 4 years ago

Conflicts are resolved. However, some tests still fail. Specifically "TestStringWidth" (see my last bullet point in my opening comment above) and "TestZeroWidthJoiner". Regarding "TestZeroWidthJoiner", the problem appears not to be the zero-width joiner but the fact that the white flag ๐Ÿณ๏ธ (U+1F3F3) is assigned a width of 1 although its actual width is 2.

My terminal (iTerm2 on macOS) seems to make the same mistake:

image

And so is Apple's native Terminal application:

image

So this seems to be a problem across applications. In go-runewidth, U+1F3F3 is not contained in the "doublewidth" table:

https://github.com/mattn/go-runewidth/blob/588506649b41f3e5b80a725a9bea447106b4f0d3/runewidth_table.go#L50

I believe it should be in there. If you change the end of this line to

{0x1F3F3, 0x1F3F4}

the test will pass. (I can do it in this PR if you want.)

mattn commented 4 years ago

the test will pass. (I can do it in this PR if you want.)

Yes, please

rivo commented 4 years ago

the test will pass. (I can do it in this PR if you want.)

Yes, please

I made that change and also fixed the test case failures resulting from it.

I also fixed the Truncate() function which had problems (see here for details).

Please note that as mentioned way above, the TestStringWidth test case still fails:

=== RUN   TestStringWidth
    TestStringWidth: runewidth_test.go:246: StringWidth("โ– ใˆฑใฎไธ–็•Œโ‘ ") = 10, want 12 (EA)
    TestStringWidth: runewidth_test.go:246: StringWidth("ใ‚นใ‚ฟใƒผโ˜†") = 7, want 8 (EA)
    TestStringWidth: runewidth_test.go:246: StringWidth("ใคใฎใ โ˜†HIRO") = 11, want 12 (EA)

I'm not familiar with the EastAsianWidth flag so I hope you'll be able to fix this. If you need my input on this, please let me know.

mattn commented 4 years ago

Thanks. Could you please fix conflicts?

rivo commented 4 years ago

Could you please tell me what changes were made in the main branch? I cannot fix the conflicts if I don't understand what changes were made in parallel.

rivo commented 4 years ago

Thanks. This still leaves the failing test regarding the EastAsianWidth flag. Please see my message above.

rivo commented 4 years ago

Your merge conflict resolutions removed the necessary changes that I made, which is why your test failed. I made them again. Tests pass now.

mattn commented 4 years ago

Test is still failing.

https://travis-ci.org/github/mattn/go-runewidth/jobs/687468815

rivo commented 4 years ago

I don't know why this is. I fixed these tests. It works for me, see here:

image

gnojus commented 4 years ago

Travis-ci is running go generate, which results in regeneration of runewidth_table.go. This changes back one value on line 52 that was changed in this PR and the test fails.

dolmen commented 4 years ago

With this change this package isn't about rune (as defined in the Go spec) anymore but about graphemes. A package name change would be appropriate.

codecov-commenter commented 4 years ago

Codecov Report

Merging #29 into master will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #29   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            2         2           
  Lines          160       151    -9     
=========================================
- Hits           160       151    -9     
Impacted Files Coverage ฮ”
runewidth.go 100.00% <100.00%> (รธ)

Continue to review full report at Codecov.

Legend - Click here to learn more ฮ” = absolute <relative> (impact), รธ = not affected, ? = missing data Powered by Codecov. Last update 14e809f...f1f639b. Read the comment docs.

rivo commented 4 years ago

Thanks @nojus297 for this hint. It took me quite a long time to figure it out. It's a surprisingly difficult issue and it looks like there is not satisfying solution at the moment. So here we go:

go-runewidth uses the doublewidth table to check if a rune has a width of 2. That table is generated by extracting all runes from the East Asian Width table that have a property "W" which means "wide".

The white flag ๐Ÿณ๏ธ (which is the basis for the rainbow flag ๐Ÿณ๏ธโ€๐ŸŒˆ) is not contained in that table. It's probably new. The way I understand it, the East Asian Width table only includes emojis for historical reasons (East Asian code points were used when emojis were introduced).

Unicode Annex #11 "East Asian Width" says (Section 5, "Recommendations"):

[UTS51] emoji presentation sequences behave as though they were East Asian Wide, regardless of their assigned East_Asian_Width property value.

I believe go-runewidth should actually check the "emoji presentation table" and return a width of 2 if a rune is contained. However, there are two problems with this:

  1. go-runewidth does not maintain such a table. It only uses the Unicode "Extended_Pictographic" table but that table also contains emojis that have a width of 1.
  2. The white flag is not even contained in the Unicode "Emoji_Presentation" table. I don't know if this is a bug or if the recommendation I quoted above is expected to fail for some cases.

In any case, for now, I have changed the test such that for the rainbow flag, a width of 1 is expected. This appears to be in line with the official Unicode information I could find. And as mentioned further above, that's how the terminals also treat it. It's not ideal because obviously, the actual flag is wider than one character. But a solution would probably require the Unicode consortium to make changes (or at least clarify).

All checks are passing now.

rivo commented 3 years ago

Any thoughts?

mattn commented 3 years ago

As I mentioned above, I don't want to remove ZeroWidthJoiner. So any part using uniseg should be separeted with flag ZeroWidthJoiner.

rivo commented 3 years ago

As I mentioned, my project tview is the only one using ZeroWidthJoiner. (I searched for it.) That's because I was the one who requested it way back. No one else is using it.

As soon as this PR is merged, I will remove ZeroWidthJoiner from tview and then nobody will use it anymore.

mattn commented 3 years ago

As I mentioned, my project tview is the only one using ZeroWidthJoiner. (I searched for it.) That's because I was the one who requested it way back. No one else is using it.

Ah, I see. Thanks your explaining. Looks good to me. I'll check this soon.

mattn commented 3 years ago

Thank you!

rivo commented 3 years ago

Thank you!