Closed gilmoreorless closed 7 years ago
cc @devongovett
Same issue with '๐ฉ๐ฝโ๐'
and '๐จ๐ฟโโ๏ธ'
, essentially anything with U+200D
I've been looking into this over the last few days, and I believe I have a fix, but it's a 2-parter (it will require a fix in regexgen combined with a tweak to the emoji-regex build process). I'll put some PRs together when I next find some spare coding time (I need to test it properly).
@gilmoreorless any way I could assist you on this noble quest? I resorted to generating a regex containing all emojis for now, but as you can imagine it's quite a long regex. I would like to see it shortened properly.
@suprMax Thanks for the offer of help. ๐
My main problem was making sure I actually understood how regexgen
worked, especially the optimisation/simplification process. I've now got a PR open to fix the regexgen
bug, and I've got another PR ready to go for emoji-regex
once that fix is merged and released. I won't submit the emoji-regex
PR before then, as it (unintuitively) makes the problem slightly worse without the regexgen
fix. Isn't software fun? ๐
@gilmoreorless regexgen v1.2.4 has been released, including your fix. Iโll wait for your PR :) Thanks so much for tackling this!
Works perfectly! Thanks for the fix guys!
This seems to be a similar case to #13, so possibly it's a more-specific instance of devongovett/regexgen#10.
Sequences such as ๐ฉโ๐งโ๐ฆ (
U+1F469 U+200D U+1F467 U+200D U+1F466
aka "family: woman, girl, boy") match all but the last symbol:I decided to check all sequences using the same looping tests as the other symbols:
This produces 86 failures, all to do with partial matches.