Regional Indicators (Flags) and Grapheme Clusters

rivo commented 5 years ago

Here's a short example that illustrates an issue with flags (or "regional indicators"):

fmt.Println(runewidth.StringWidth("🇩🇪")) // Should be "2", outputs "4".

The flag consists of two code points which are processed separately by runewidth. But most modern systems will combine them into one flag emoji.

This is part of a larger topic which I describe in more detail here: gdamore/tcell#264. It doesn't just affect flags but also characters in e.g. Arabic and Korean where there are more sophisticated rules than "combining characters" and zero-width joiners (which you added with #20).

I don't know exactly how you calculate the widths of characters. I'm also not sure how you would solve flags as well as some of the other rules described in the Unicode specification but it would sure be nice as printing these flags currently gives me trouble in tview. There have been multiple issues asking for better support for different languages and emojis so it seems that there are quite a few people who use the terminal with these characters.

(Maybe my new package uniseg can help you here.)

rivo commented 5 years ago

Here's my own implementation of the "string width" function which takes grapheme clusters into account:

https://github.com/rivo/tview/blob/8d5eba0c2f51d8ae971c5a470e354bbc2aae6777/util.go#L419

It's based on the assumption that the width of a grapheme cluster is the width of the first non-zero-width rune. That's just my guess but it works fine for a bunch of examples I tried manually.

Maybe you want to use this implementation in your package. I think it would definitively improve the calculation of a string width. You could then also get rid of the special zero-width-joiner handling as it's all implicit in the uniseg package.

mattn commented 5 years ago

Could you please send me PR?

alecrabbit commented 5 years ago

Hi! I'm not sure if this issues related but assume they are. characters {"←", "↖", "↑", "↗", "→", "↘", "↓", "↙"} accepted by my terminal as of width 1 and all is working as it should, however runewidth.StringWidth(char) is giving [1 2 1 2 1 2 1 2] correspondingly and that breaks output

// Character StringWidth uniseg.Graphemes
    ←         1           [2190]
    ↖         2           [2196]
    ↑         1           [2191]
    ↗         2           [2197]
    →         1           [2192]
    ↘         2           [2198]
    ↓         1           [2193]
    ↙         2           [2199]

same for

// Character StringWidth uniseg.Graphemes
    ■         1           [25a0]
    □         1           [25a1]
    ▪         2           [25aa]
    ▫         2           [25ab]

I hope this additional info will help.

My php package php-wcwidth (which is practically a dumb clone of python's jquast/wcwidth) gets widths of these chars correctly

mattn commented 5 years ago

Thank you. Could you please show me screenshot?

This is an screenshot taken on my environment.

alecrabbit commented 5 years ago

this one?

alecrabbit commented 5 years ago

same but larger

alecrabbit commented 5 years ago

and from terminal

mattn commented 5 years ago

What is your $LANG?

alecrabbit commented 5 years ago

LANG=en_US.UTF-8

mattn commented 5 years ago

@joshuarubin 0x2194 in emoji is correctly?

alecrabbit commented 5 years ago

@mattn here's what I found out: these do not have an emojis

but these do:

alecrabbit commented 5 years ago

and my terminal can print them both however, I'm unable to figure out how to print it from my code printing by code gives ↔ copy-pasting also gives ↔

alecrabbit commented 5 years ago

it seems like 2194 is followed by fe0f to print emoji so 2194 fe0f

UPD DerivedGeneralCategory.txt:

FE00..FE0F    ; Mn #  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16

mattn / go-runewidth

Regional Indicators (Flags) and Grapheme Clusters #28