Emoji with U+fe0f match only first character

jcubic commented 5 years ago

Another Emoji that don't match properly:

☝️ Index Pointing Up U+261d U+fe0f

it match only first character ☝

tonton-pixel commented 5 years ago

I actually wrote a similar module myself and, after a lot of extensive testing, I think I found why the reported match is incorrect.

IMHO, there is a slight flaw in the code used to generate the emoji regular expression:

module.exports = () => {
    // https://mathiasbynens.be/notes/es-unicode-property-escapes#emoji
    return /<% emojiSequence %>|\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
};

From left to right:

U+261D U+FE0F is not matched by any emoji sequence parsed from emoji-sequences.txt or emoji-zwj-sequences.txt (the only sequences involving U+261D are the five skin tone variations U+261D U+1F3FB to U+261D U+1F3FF).
U+261D U+FE0F is not matched by \p{Emoji_Modifier_Base}\p{Emoji_Modifier} for the same reason, \p{Emoji_Modifier} as defined in emoji-data.txt can only be one of U+1F3FB to U+1F3FF.
Since \p{Emoji_Modifier} is optional, U+261D U+FE0F is then tested against \p{Emoji_Modifier_Base} only, and a match is found indeed but just for the first code point U+261D; since a regular expression engine is eager, it stops searching as soon as it finds a valid match, which prevents the rest of the expression to be tested, namely the last part \p{Emoji}\uFE0F which is the expression which would have produced the right match (\p{Emoji_Presentation} wouldn't have been a proper candidate either since it represents characters which an emoji presentation by default, and doesn't include U+261D).

So, I think \p{Emoji_Modifier} should not be optional in \p{Emoji_Modifier_Base}\p{Emoji_Modifier}?.

Actually, the whole expression could be entirely dropped since it is already taken care of by the injected <% emojiSequence %> which contain all the sequences of type Emoji_Modifier_Sequence which are strictly equivalent.

So, it should be instead:

module.exports = () => {
    return /<% emojiSequence %>|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
};

mathiasbynens commented 5 years ago

@tonton-pixel You’re absolutely right! The clearest way to express this is by using a not-yet(?)-standard RegExp feature, as described here: https://github.com/tc39/proposal-regexp-unicode-sequence-properties#matching-all-emoji-including-emoji-sequences

In other words, emojiSequence expands to what’s described here: https://github.com/tc39/proposal-regexp-unicode-sequence-properties#matching-emoji-sequences

const reEmojiSequence = /\p{Emoji_Flag_Sequence}|\p{Emoji_Tag_Sequence}|\p{Emoji_ZWJ_Sequence}|\p{Emoji_Keycap_Sequence}|\p{Emoji_Modifier_Sequence}/u;

mathiasbynens commented 5 years ago

Actually, the Emoji_Modifier_Base comment is wrong. <% emojiSequence %> includes \p{Emoji_Modifier_Sequence}, but an \p{Emoji_Modifier_Base} symbol that is NOT followed by a \p{Emoji_Modifier} symbol doesn’t form a sequence, but it’s still an emoji. Here are some examples:

☝⛹✊✋✌✍🎅🏃🏄🏊🏋👂👃👆👇👈👉👊👋👌👍👎👏👐👦👧👨👩👮👰👱👲👳👴👵👶👷👸👼💁💂💃💅💆💇💪🕵🕺🖐🖕🖖🙅🙆🙇🙋🙌🙍🙎🙏🚣🚴🚵🚶🛀🤘🤙🤚🤛🤜🤝🤞🤦🤰🤳🤴🤵🤶🤷🤸🤹🤼🤽🤾'

So we need \p{Emoji_Modifier_Sequence} (as included in emojiSequence) but in addition, we need \p{Emoji_Modifier_Base}.

mathiasbynens commented 5 years ago

Note that U+261D has both \p{Emoji_Modifier_Base} and \p{Emoji}.

mathiasbynens / emoji-regex

Emoji with U+fe0f match only first character #43