mathiasbynens / emoji-regex

A regular expression to match all Emoji-only symbols as per the Unicode Standard.
https://mths.be/emoji-regex
MIT License
1.73k stars 174 forks source link

Emoji with U+fe0f match only first character #43

Closed jcubic closed 5 years ago

jcubic commented 5 years ago

Another Emoji that don't match properly:

it match only first character ☝

tonton-pixel commented 5 years ago

I actually wrote a similar module myself and, after a lot of extensive testing, I think I found why the reported match is incorrect.

IMHO, there is a slight flaw in the code used to generate the emoji regular expression:

module.exports = () => {
    // https://mathiasbynens.be/notes/es-unicode-property-escapes#emoji
    return /<% emojiSequence %>|\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
};

From left to right:

So, I think \p{Emoji_Modifier} should not be optional in \p{Emoji_Modifier_Base}\p{Emoji_Modifier}?.

Actually, the whole expression could be entirely dropped since it is already taken care of by the injected <% emojiSequence %> which contain all the sequences of type Emoji_Modifier_Sequence which are strictly equivalent.

So, it should be instead:

module.exports = () => {
    return /<% emojiSequence %>|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
};
mathiasbynens commented 5 years ago

@tonton-pixel You’re absolutely right! The clearest way to express this is by using a not-yet(?)-standard RegExp feature, as described here: https://github.com/tc39/proposal-regexp-unicode-sequence-properties#matching-all-emoji-including-emoji-sequences

In other words, emojiSequence expands to what’s described here: https://github.com/tc39/proposal-regexp-unicode-sequence-properties#matching-emoji-sequences

const reEmojiSequence = /\p{Emoji_Flag_Sequence}|\p{Emoji_Tag_Sequence}|\p{Emoji_ZWJ_Sequence}|\p{Emoji_Keycap_Sequence}|\p{Emoji_Modifier_Sequence}/u;
mathiasbynens commented 5 years ago

Actually, the Emoji_Modifier_Base comment is wrong. <% emojiSequence %> includes \p{Emoji_Modifier_Sequence}, but an \p{Emoji_Modifier_Base} symbol that is NOT followed by a \p{Emoji_Modifier} symbol doesn’t form a sequence, but it’s still an emoji. Here are some examples:

☝⛹✊✋✌✍🎅🏃🏄🏊🏋👂👃👆👇👈👉👊👋👌👍👎👏👐👦👧👨👩👮👰👱👲👳👴👵👶👷👸👼💁💂💃💅💆💇💪🕵🕺🖐🖕🖖🙅🙆🙇🙋🙌🙍🙎🙏🚣🚴🚵🚶🛀🤘🤙🤚🤛🤜🤝🤞🤦🤰🤳🤴🤵🤶🤷🤸🤹🤼🤽🤾'

So we need \p{Emoji_Modifier_Sequence} (as included in emojiSequence) but in addition, we need \p{Emoji_Modifier_Base}.

mathiasbynens commented 5 years ago

Note that U+261D has both \p{Emoji_Modifier_Base} and \p{Emoji}.