mathiasbynens / emoji-regex

A regular expression to match all Emoji-only symbols as per the Unicode Standard.
https://mths.be/emoji-regex
MIT License
1.73k stars 174 forks source link

Many sequences are only being partially matched #16

Closed gilmoreorless closed 7 years ago

gilmoreorless commented 7 years ago

This seems to be a similar case to #13, so possibly it's a more-specific instance of devongovett/regexgen#10.

Sequences such as ๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ (U+1F469 U+200D U+1F467 U+200D U+1F466 aka "family: woman, girl, boy") match all but the last symbol:

const wgb = '\u{1F469}\u200D\u{1F467}\u200D\u{1F466}';
wgb.match(emojiRegex());
/*
Result:
[
  '๐Ÿ‘ฉโ€๐Ÿ‘ง',  U+1F469 U+200D U+1F467
  '๐Ÿ‘ฆ'   U+1F466
]
*/

I decided to check all sequences using the same looping tests as the other symbols:

        // Test a ZWJ emoji sequence (`emoji-zwj-sequences.txt`).
        test('\u{1F3CA}\u{1F3FD}\u200D\u2640\uFE0F');

+       // Test all emoji sequences
+       const sequences = require('unicode-tr51/sequences.js');
+       for (const sequence of sequences) {
+               test(sequence);
+       }

This produces 86 failures, all to do with partial matches.

mathiasbynens commented 7 years ago

cc @devongovett

max-degterev commented 7 years ago

Same issue with '๐Ÿ‘ฉ๐Ÿฝโ€๐Ÿš€' and '๐Ÿ‘จ๐Ÿฟโ€โœˆ๏ธ', essentially anything with U+200D

gilmoreorless commented 7 years ago

I've been looking into this over the last few days, and I believe I have a fix, but it's a 2-parter (it will require a fix in regexgen combined with a tweak to the emoji-regex build process). I'll put some PRs together when I next find some spare coding time (I need to test it properly).

max-degterev commented 7 years ago

@gilmoreorless any way I could assist you on this noble quest? I resorted to generating a regex containing all emojis for now, but as you can imagine it's quite a long regex. I would like to see it shortened properly.

gilmoreorless commented 7 years ago

@suprMax Thanks for the offer of help. ๐Ÿ™‚

My main problem was making sure I actually understood how regexgen worked, especially the optimisation/simplification process. I've now got a PR open to fix the regexgen bug, and I've got another PR ready to go for emoji-regex once that fix is merged and released. I won't submit the emoji-regex PR before then, as it (unintuitively) makes the problem slightly worse without the regexgen fix. Isn't software fun? ๐Ÿ˜

mathiasbynens commented 7 years ago

@gilmoreorless regexgen v1.2.4 has been released, including your fix. Iโ€™ll wait for your PR :) Thanks so much for tackling this!

max-degterev commented 7 years ago

Works perfectly! Thanks for the fix guys!