Closed rivo closed 3 years ago
Thanks your contribution. I'm thinking there are terminals which does not support ZWJ yet. So I prefer to disable ZWJ.
How to add enable/disable ZWJ?
Using rivo/uniseg
means we're doing proper grouping of runes now. There are many more rules than just the handling of a zero-width joiner, see here. As an example, flags such as ๐ฏ๐ต consist of two runes but don't contain zero-width joiners, they follow other rules.
If you want to support terminals that don't render these characters, the question is how do they render them? If the difference is that some terminals render characters rune-by-rune, then a flag like UnicodeSegmentation
would be more appropriate.
But singling out the zero-width joiner rule (which is only one of 13 rules) is probably not very useful.
Let me know what you think about this.
Most of terminals which does not support ZWJ displays Unicode characters as what we can see. For example, ๐ช
is displayed as ๐จโ๐ฉโ๐ฆ
.
So you're saying that these terminals support all kinds of Unicode characters (e.g. flags, Hangul, spacing/prepend marks) but ZWJ is the one they don't support?
Do you have an example for such a terminal?
For example, gnome-terminal displays ๐ช
as ๐จโ๐ฉโ๐ฆ
.
Also ZeroWidthJoiner is already used in some projects.
https://github.com/4avka/tview/blob/d7a9d6c70ab205d578fc64f02f4e681ccb5b5180/util.go#L56
What about แ
แ
ฌแซ
, ๐ฉ๐ช
, oฬ
, or ๐ณ๏ธโ๐
? How do these display in a gnome terminal?
(tview
is my project. I will remove the ZWJ handling. I only used it becausego-runewidth
required setting it.)
I'll check it in later. But probably
แ แ ฌแซ is 2 cells ๐ฉ๐ช is 2 cells oฬ is 1 or 2 cells ๐ณ๏ธโ๐ is 4 cells
Ok. Let me know when you got it.
Actually, here on GitHub, your answer suggests gnome terminal uses ZWJ properly. Both emojis look identical here.
What I look is:
Interesting. Well, let me know what the result of these other characters is.
If we need to be able to turn off individual Unicode breaking rules, then that functionality needs to be implemented in uniseg
as well. Actually, in my opinion, it would even be better to set these flags not here in go-runewidth
but in uniseg
because these are grapheme breaking flags and that's the package that does it. It would then automatically affect your package, too.
I'll wait for the widths of the characters you wanted to look up and we'll see then how to proceed.
What about แ แ ฌแซ, ๐ฉ๐ช, oฬ, or ๐ณ๏ธโ๐? How do these display in a gnome terminal?
If it helps here how these chars look in my terminal(Ubuntu 18.04/Gnome)
// Character StringWidth uniseg.Graphemes
php-wcwidth's
algo gets incorrect width forแ แ ฌแซ
-> 4, in terminal it displayed with 2
@rivo Can you revert removing ZeroWidthJoiner?
Any thought?
Apologies for the late reply. And thanks to @alecrabbit for providing the screenshot. Well, reintroducing the ZeroWidthJoiner
(ZWJ) flag will not solve the issue with the family icon (๐ช) because it doesn't even contain a ZWJ code point (your own package, prior to this PR, gives 2 as a width, no matter how the flag is set). As mentioned before, the issue is larger than just the zero-width joiner topic. I.e. reverting will not solve much.
Also, the rivo/uniseg
package has no way to disable individual rules so I don't even know how I would implement a ZWJ deactivation. Especially because it could have side effects on other rules.
If you insist on keeping that flag in there, I think you'll need to ignore this pull request.
I think it's generally difficult or even impossible to make this package work on all platforms. One would have to keep a database of how the different platforms render different characters. Since new emojis are introduced all the time, that's whole bunch of work. And it seems that not even popular packages get this right.
Here's iTerm on macOS 10.14:
Same with Apple's Terminal:
In short, I have no solution. There will always be someone who will complain. Unless someone is willing to put the effort into building a huge database of terminals+OSs and how they render characters, there will be mistakes.
But personally, I think this PR gets us closer to the correct way to handle this than the current version.
Could you please fix conflict?
Conflicts are resolved. However, some tests still fail. Specifically "TestStringWidth" (see my last bullet point in my opening comment above) and "TestZeroWidthJoiner". Regarding "TestZeroWidthJoiner", the problem appears not to be the zero-width joiner but the fact that the white flag ๐ณ๏ธ (U+1F3F3) is assigned a width of 1 although its actual width is 2.
My terminal (iTerm2 on macOS) seems to make the same mistake:
And so is Apple's native Terminal application:
So this seems to be a problem across applications. In go-runewidth
, U+1F3F3 is not contained in the "doublewidth" table:
I believe it should be in there. If you change the end of this line to
{0x1F3F3, 0x1F3F4}
the test will pass. (I can do it in this PR if you want.)
the test will pass. (I can do it in this PR if you want.)
Yes, please
the test will pass. (I can do it in this PR if you want.)
Yes, please
I made that change and also fixed the test case failures resulting from it.
I also fixed the Truncate()
function which had problems (see here for details).
Please note that as mentioned way above, the TestStringWidth
test case still fails:
=== RUN TestStringWidth
TestStringWidth: runewidth_test.go:246: StringWidth("โ ใฑใฎไธ็โ ") = 10, want 12 (EA)
TestStringWidth: runewidth_test.go:246: StringWidth("ในใฟใผโ") = 7, want 8 (EA)
TestStringWidth: runewidth_test.go:246: StringWidth("ใคใฎใ โHIRO") = 11, want 12 (EA)
I'm not familiar with the EastAsianWidth
flag so I hope you'll be able to fix this. If you need my input on this, please let me know.
Thanks. Could you please fix conflicts?
Could you please tell me what changes were made in the main branch? I cannot fix the conflicts if I don't understand what changes were made in parallel.
Thanks. This still leaves the failing test regarding the EastAsianWidth
flag. Please see my message above.
Your merge conflict resolutions removed the necessary changes that I made, which is why your test failed. I made them again. Tests pass now.
Test is still failing.
https://travis-ci.org/github/mattn/go-runewidth/jobs/687468815
I don't know why this is. I fixed these tests. It works for me, see here:
Travis-ci
is running go generate
, which results in regeneration of runewidth_table.go
. This changes back one value on line 52 that was changed in this PR and the test fails.
With this change this package isn't about rune
(as defined in the Go spec) anymore but about graphemes. A package name change would be appropriate.
Merging #29 into master will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## master #29 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 2 2
Lines 160 151 -9
=========================================
- Hits 160 151 -9
Impacted Files | Coverage ฮ | |
---|---|---|
runewidth.go | 100.00% <100.00%> (รธ) |
Continue to review full report at Codecov.
Legend - Click here to learn more
ฮ = absolute <relative> (impact)
,รธ = not affected
,? = missing data
Powered by Codecov. Last update 14e809f...f1f639b. Read the comment docs.
Thanks @nojus297 for this hint. It took me quite a long time to figure it out. It's a surprisingly difficult issue and it looks like there is not satisfying solution at the moment. So here we go:
go-runewidth
uses the doublewidth
table to check if a rune has a width of 2. That table is generated by extracting all runes from the East Asian Width table that have a property "W" which means "wide".
The white flag ๐ณ๏ธ (which is the basis for the rainbow flag ๐ณ๏ธโ๐) is not contained in that table. It's probably new. The way I understand it, the East Asian Width table only includes emojis for historical reasons (East Asian code points were used when emojis were introduced).
Unicode Annex #11 "East Asian Width" says (Section 5, "Recommendations"):
[UTS51] emoji presentation sequences behave as though they were East Asian Wide, regardless of their assigned East_Asian_Width property value.
I believe go-runewidth
should actually check the "emoji presentation table" and return a width of 2 if a rune is contained. However, there are two problems with this:
go-runewidth
does not maintain such a table. It only uses the Unicode "Extended_Pictographic" table but that table also contains emojis that have a width of 1.In any case, for now, I have changed the test such that for the rainbow flag, a width of 1 is expected. This appears to be in line with the official Unicode information I could find. And as mentioned further above, that's how the terminals also treat it. It's not ideal because obviously, the actual flag is wider than one character. But a solution would probably require the Unicode consortium to make changes (or at least clarify).
All checks are passing now.
Any thoughts?
As I mentioned above, I don't want to remove ZeroWidthJoiner. So any part using uniseg should be separeted with flag ZeroWidthJoiner.
As I mentioned, my project tview
is the only one using ZeroWidthJoiner
. (I searched for it.) That's because I was the one who requested it way back. No one else is using it.
As soon as this PR is merged, I will remove ZeroWidthJoiner
from tview
and then nobody will use it anymore.
As I mentioned, my project tview is the only one using ZeroWidthJoiner. (I searched for it.) That's because I was the one who requested it way back. No one else is using it.
Ah, I see. Thanks your explaining. Looks good to me. I'll check this soon.
Thank you!
Thank you!
This introduces an implementation of StringWidth() using Unicode grapheme clusters which should be the correct way to split a string into its individual characters. The built-in assumption is that if we have combined runes (emojis, flags etc.), their width is the width of the first non-zero-width rune. Many of these combined runes were previously not handled correctly by this package.
Please note:
rivo/uniseg
over.)TestStringWidth
test case but only the part whereEastAsianWidth = true
. I'm not very familiar with this flag so I don't know how to fix that. You may want to review this.