Closed jcubic closed 5 years ago
I actually wrote a similar module myself and, after a lot of extensive testing, I think I found why the reported match is incorrect.
IMHO, there is a slight flaw in the code used to generate the emoji regular expression:
module.exports = () => {
// https://mathiasbynens.be/notes/es-unicode-property-escapes#emoji
return /<% emojiSequence %>|\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
};
From left to right:
U+261D U+FE0F
is not matched by any emoji sequence parsed from emoji-sequences.txt or emoji-zwj-sequences.txt (the only sequences involving U+261D
are the five skin tone variations U+261D U+1F3FB
to U+261D U+1F3FF
).U+261D U+FE0F
is not matched by \p{Emoji_Modifier_Base}\p{Emoji_Modifier}
for the same reason, \p{Emoji_Modifier}
as defined in emoji-data.txt can only be one of U+1F3FB
to U+1F3FF
.\p{Emoji_Modifier}
is optional, U+261D U+FE0F
is then tested against \p{Emoji_Modifier_Base}
only, and a match is found indeed but just for the first code point U+261D
; since a regular expression engine is eager, it stops searching as soon as it finds a valid match, which prevents the rest of the expression to be tested, namely the last part \p{Emoji}\uFE0F
which is the expression which would have produced the right match (\p{Emoji_Presentation}
wouldn't have been a proper candidate either since it represents characters which an emoji presentation by default, and doesn't include U+261D
).So, I think \p{Emoji_Modifier}
should not be optional in \p{Emoji_Modifier_Base}\p{Emoji_Modifier}?
.
Actually, the whole expression could be entirely dropped since it is already taken care of by the injected <% emojiSequence %> which contain all the sequences of type Emoji_Modifier_Sequence
which are strictly equivalent.
So, it should be instead:
module.exports = () => {
return /<% emojiSequence %>|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;
};
@tonton-pixel You’re absolutely right! The clearest way to express this is by using a not-yet(?)-standard RegExp feature, as described here: https://github.com/tc39/proposal-regexp-unicode-sequence-properties#matching-all-emoji-including-emoji-sequences
In other words, emojiSequence
expands to what’s described here: https://github.com/tc39/proposal-regexp-unicode-sequence-properties#matching-emoji-sequences
const reEmojiSequence = /\p{Emoji_Flag_Sequence}|\p{Emoji_Tag_Sequence}|\p{Emoji_ZWJ_Sequence}|\p{Emoji_Keycap_Sequence}|\p{Emoji_Modifier_Sequence}/u;
Actually, the Emoji_Modifier_Base
comment is wrong. <% emojiSequence %>
includes \p{Emoji_Modifier_Sequence}
, but an \p{Emoji_Modifier_Base}
symbol that is NOT followed by a \p{Emoji_Modifier}
symbol doesn’t form a sequence, but it’s still an emoji. Here are some examples:
☝⛹✊✋✌✍🎅🏃🏄🏊🏋👂👃👆👇👈👉👊👋👌👍👎👏👐👦👧👨👩👮👰👱👲👳👴👵👶👷👸👼💁💂💃💅💆💇💪🕵🕺🖐🖕🖖🙅🙆🙇🙋🙌🙍🙎🙏🚣🚴🚵🚶🛀🤘🤙🤚🤛🤜🤝🤞🤦🤰🤳🤴🤵🤶🤷🤸🤹🤼🤽🤾'
So we need \p{Emoji_Modifier_Sequence}
(as included in emojiSequence
) but in addition, we need \p{Emoji_Modifier_Base}
.
Note that U+261D has both \p{Emoji_Modifier_Base}
and \p{Emoji}
.
Another Emoji that don't match properly:
it match only first character ☝