mathiasbynens / emoji-regex

A regular expression to match all Emoji-only symbols as per the Unicode Standard.
https://mths.be/emoji-regex
MIT License
1.73k stars 174 forks source link

Avoid matching emoji followed by text variation selector (U+FE0E) #61

Open dbrgn opened 5 years ago

dbrgn commented 5 years ago

First of all, thanks for this project! It's very useful.

It appears that the regex even matches codepoints that are followed by a text variant selector (FE0E).

The exclamation mark is an emoji with emoji-default representation. It should be matched both without a variant selector and with an emoji variant selector (FE0F).

However, it should not be matched when followed by a text variant selector (FE0E).

let m: string[];
console.info('no variation');
const r1 = emojiRegex();
while ((m = r1.exec('\u2757')) !== null) {
    console.log('match', m, 'lastIndex', r1.lastIndex);
}
const r2 = emojiRegex();
console.info('text variation');
while ((m = r2.exec('\u2757\ufe0e')) !== null) {
    console.log('match', m, 'lastIndex', r2.lastIndex);
}
const r3 = emojiRegex();
console.info('emoji variation');
while ((m = r3.exec('\u2757\ufe0f')) !== null) {
    console.log('match', m, 'lastIndex', r3.lastIndex);
}

This will match the emoji 3 times, each time with length 1.

My expectation would be that the version without variant selector is matched with length 1, that the version with emoji variant selector is matched with length 2, and that the version with text variant selector is not matched at all.

ragurney commented 4 years ago

Any update on this?

mathiasbynens commented 3 years ago

Good catch. We could add a negative lookahead for \uFE0E (that is, (?!\uFE0E)) to avoid matching in the text variation case. I think that's all there is to do here, for the following reasons.

For the emoji variation selector case, I'd expect it to match just the emoji itself in case that's already an RGI_Emoji string, and only to match the emoji + the emoji variation selector as a whole in cases where the emoji is unqualified by itself (i.e. where the variation selector is not redundant). This matches the spec. https://unicode.org/reports/tr51/#def_basic_emoji_set says:

ED-20. basic emoji set — The set of emoji characters and emoji presentation sequences listed in the emoji-sequences.txt file [emoji-data] under the type_field Basic_Emoji.

  • This is the set of emoji intended for general-purpose input.
  • This set excludes all instances of an emoji component, which are not intended for independent, direct input.
  • This set otherwise includes all instances of an emoji character with the property value Emoji_Presentation = Yes and all instances of a valid emoji presentation sequence whose base character has the property value Emoji_Presentation = No.

The sequence U+2757 U+FE0F is a valid presentation sequence per emoji-variation-sequences.txt, but since U+2757 by itself already has Emoji_Presentation = Yes (not No) it’s not included in RGI_Emoji.

TL;DR U+2757 is RGI_Emoji, but U+2757 U+FE0F is not.

References to the relevant data files follow.


https://unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt

2757          ; Emoji                # E0.6   [1] (❗)       exclamation mark

https://unicode.org/Public/13.0.0/ucd/emoji/emoji-variation-sequences.txt

2757 FE0E  ; text style;  # (5.2) HEAVY EXCLAMATION MARK SYMBOL
2757 FE0F  ; emoji style; # (5.2) HEAVY EXCLAMATION MARK SYMBOL

https://unicode.org/Public/emoji/13.1/emoji-sequences.txt

2757          ; Basic_Emoji                  ; red exclamation mark                                           # E0.6   [1] (❗)
mirabilos commented 1 year ago

Yes, fix this please! This is appalling, I hate seeing my favourite characters (like U+263A) get Emoji presentation by default suddenly because more than half the software I see doesn’t implement variant selection properly. This is a bug with massive impact, and anything suffixed with U+FE0E must not be rendered as Emoji!