Open dbrgn opened 5 years ago
Any update on this?
Good catch. We could add a negative lookahead for \uFE0E
(that is, (?!\uFE0E)
) to avoid matching in the text variation case. I think that's all there is to do here, for the following reasons.
For the emoji variation selector case, I'd expect it to match just the emoji itself in case that's already an RGI_Emoji
string, and only to match the emoji + the emoji variation selector as a whole in cases where the emoji is unqualified by itself (i.e. where the variation selector is not redundant). This matches the spec. https://unicode.org/reports/tr51/#def_basic_emoji_set says:
ED-20. basic emoji set — The set of emoji characters and emoji presentation sequences listed in the
emoji-sequences.txt
file [emoji-data] under the type_fieldBasic_Emoji
.
- This is the set of emoji intended for general-purpose input.
- This set excludes all instances of an emoji component, which are not intended for independent, direct input.
- This set otherwise includes all instances of an emoji character with the property value
Emoji_Presentation = Yes
and all instances of a valid emoji presentation sequence whose base character has the property valueEmoji_Presentation = No
.
The sequence U+2757 U+FE0F is a valid presentation sequence per emoji-variation-sequences.txt
, but since U+2757 by itself already has Emoji_Presentation = Yes
(not No
) it’s not included in RGI_Emoji
.
TL;DR U+2757 is RGI_Emoji
, but U+2757 U+FE0F is not.
References to the relevant data files follow.
https://unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt
2757 ; Emoji # E0.6 [1] (❗) exclamation mark
https://unicode.org/Public/13.0.0/ucd/emoji/emoji-variation-sequences.txt
2757 FE0E ; text style; # (5.2) HEAVY EXCLAMATION MARK SYMBOL
2757 FE0F ; emoji style; # (5.2) HEAVY EXCLAMATION MARK SYMBOL
https://unicode.org/Public/emoji/13.1/emoji-sequences.txt
2757 ; Basic_Emoji ; red exclamation mark # E0.6 [1] (❗)
Yes, fix this please! This is appalling, I hate seeing my favourite characters (like U+263A) get Emoji presentation by default suddenly because more than half the software I see doesn’t implement variant selection properly. This is a bug with massive impact, and anything suffixed with U+FE0E must not be rendered as Emoji!
First of all, thanks for this project! It's very useful.
It appears that the regex even matches codepoints that are followed by a text variant selector (FE0E).
The exclamation mark is an emoji with emoji-default representation. It should be matched both without a variant selector and with an emoji variant selector (FE0F).
However, it should not be matched when followed by a text variant selector (FE0E).
This will match the emoji 3 times, each time with length 1.
My expectation would be that the version without variant selector is matched with length 1, that the version with emoji variant selector is matched with length 2, and that the version with text variant selector is not matched at all.