unicode-rs / unicode-width

Displayed width of Unicode characters and strings according to UAX#11 rules.
https://unicode-rs.github.io/unicode-width
Other
215 stars 27 forks source link

Emoji width #4

Closed gwenn closed 4 years ago

gwenn commented 8 years ago

I am not sure but the displayed width of emoji seems to be at least 2:

"❤️"
"12"
let w = unicode_width::UnicodeWidthStr::width("\u{2764}\u{fe0f}");
assert_eq!(2, w); // (left: `2`, right: `1`)
kwantam commented 8 years ago

The Unicode standard defines which characters should be considered wide and which should not. To my knowledge, emoji are not considered wide characters by the standard. Note also that width refers to number of columns when displayed in monospaced font; any character can appear wider when displayed in a proportional font.

(Anecdotally, the heart symbol above occupies one column in my Unicode-aware terminal.)

casey commented 7 years ago

From reading this, I believe that as of Unicode 9, emoji are now wide characters.

I also seems that as of unicode-width 0.1.4 emojis are considered to be wide characters, so this can be closed.

PS Thanks for writing this library!

gwenn commented 7 years ago

unicode-width 0.1.4 returns 1...

casey commented 7 years ago

Ah, it looks like unicode-width 0.1.4 reports that a ❤️ is one column wide, and a 😗 is two columns wide. I didn't specifically test the heart character, just emoji.

ogham commented 7 years ago

I thought I was experiencing this, but it turns out that my terminal was just getting the widths wrong and I was seeing it the wrong way!

typesanitizer commented 6 years ago

Apart from this, there is a problem with compound emojis. The current implementation just splits things up into characters and adds all the widths. That may not be correct in the presence of compound emojis like 👩‍🔬 = 👩 + ZWJ + 🔬 , as all the individual emojis have width 2.

Manishearth commented 6 years ago

I don't think handling that is what this crate is about -- this crate implements a spec, a spec which doesn't attempt to deal with emoji.

typesanitizer commented 6 years ago

The docs say "we provide the width in columns". For characters in X, Y, Z categories, we do A, B, C. AIUI Emoji don't really fall into those categories, so I'd naively expect the result to be whatever makes the most sense (if there is one such result). Depending on the user's system -- whether the compound emoji can be rendered properly or not (in which case, it shows up as two separate emoji) -- the computed width will be different. The crate picks the width you'd get when it shows as split up, which is a reasonable choice.

However, since there are two reasonable answers here, I think if the precise scope and limitations of the crate were made clearer, then the behavior for compound emoji wouldn't be an issue. I'm happy to open a PR to add this clarification if you agree.

Manishearth commented 6 years ago

if there is one such result

There kinda isn't, the concept of "width" you're asking for is a matter of font, as well as the context (many terminals will not use emoji presentation, which means those will display as two)

The crate does already mention that it follows the UTS 11 rules. Feel free to add to the readme that this may not match actual rendered column width.

canndrew commented 5 years ago

I'd been using this crate on the assumption that UnicodeWidthStr::width would give the actual displayed width in columns. It's a shame that that assumption doesn't hold :/

Is there a non-trivial subset of strings for which the displayed column width is exactly specified and we can rely on it being accurate for any standards-compliant terminal? If so, can we add another method to UnicodeWidthStr which returns an Option<usize>? That way my terminal GUI library can know when it might have lost track of the cursor position.

keidax commented 5 years ago

In regards to UAX #11, the recommendations state

UTS51 emoji presentation sequences behave as though they were East Asian Wide, regardless of their assigned East_Asian_Width property value.

and as best as I can tell from this definition, "\u{2764}\u{fe0f}" would be a valid emoji presentation sequence.

In other words, it seems like the most "correct" behavior for a character with a text presentation by default, like U+2764, would be

assert_eq!(1, UnicodeWidthStr::width("\u{2764}"));
assert_eq!(1, UnicodeWidthStr::width("\u{2764}\u{fe0e}"));
assert_eq!(2, UnicodeWidthStr::width("\u{2764}\u{fe0f}"));

And for a character with an emoji presentation by default:

assert_eq!(2, UnicodeWidthStr::width("\u{26a1}"));
assert_eq!(1, UnicodeWidthStr::width("\u{26a1}\u{fe0e}"));
assert_eq!(2, UnicodeWidthStr::width("\u{26a1}\u{fe0f}"));

Of course, the rendering of this also seems to vary by OS and browser: ❤ ❤︎ ❤️ ⚡ ⚡︎ ⚡️

wez commented 5 years ago

I don't really know much about this space, but here's my attempt at dealing with this in a terminal emulator.

/// Returns the number of cells visually occupied by a sequence
/// of graphemes
pub fn unicode_column_width(s: &str) -> usize {
    use unicode_segmentation::UnicodeSegmentation;
    s.graphemes(true).map(grapheme_column_width).sum()
}

/// Returns the number of cells visually occupied by a grapheme.
/// The input string must be a single grapheme.
pub fn grapheme_column_width(s: &str) -> usize {
    // Due to this issue:
    // https://github.com/unicode-rs/unicode-width/issues/4
    // we cannot simply use the unicode-width crate to compute
    // the desired value.
    // Let's check for emoji-ness for ourselves first
    use xi_unicode::EmojiExt;
    for c in s.chars() {
        if c.is_emoji_modifier_base() || c.is_emoji_modifier() {
            // treat modifier sequences as double wide
            return 2;
        }
    }
    UnicodeWidthStr::width(s)
}
worldmind commented 4 years ago

Not sure, but suppose that example from this article related to this issue:

fn main() {
    println!("{}", "🤦🏼‍♂️".width());
}

returns 5, but article author think that it must be 2

Manishearth commented 4 years ago

Right, this crate is dealing with a different notion of width.

christianparpart commented 4 years ago

@keidax is actually right. I came here not as a rust dev, but more as a VTE dev, because I actually forgot where in the huge mass of unicode (emoji) specs I was reading that emoji presentation is always considered to be east Asian wide (2 columns in mono spaced fonts). -- so thanks for also having provided the links @keidax.

Sadly many VTEs and even client apps are still getting this wrong, but it seems to shift slightly (Kitty for example gets a lot of it right).

Jules-Bertholet commented 7 months ago

41 added support for U+FE0F. (Emoji ZWJ sequences and skintone modifiers remain unsupported, however.)