mathiasbynens / emoji-regex

A regular expression to match all Emoji-only symbols as per the Unicode Standard.
https://mths.be/emoji-regex
MIT License
1.73k stars 174 forks source link

Update regexgen to v1.2.4 #21

Closed gilmoreorless closed 7 years ago

gilmoreorless commented 7 years ago

This is the fix for #16, which turned out to have 2 root causes:

  1. regexgen had a bug that resulted in nested alternations not being correctly sorted by length (fixed in devongovett/regexgen#16).
  2. Due to the way regexgen builds and simplifies its internal representation, I've found that it optimises better when the inputs are provided in order from longest to shortest. As such, sorting the sequences list by length before passing it into the Trie class produces a regex that correctly matches longer sequences before shorter ones.

Interestingly, fixing #16 required both of the points above to be tackled at the same time. Adding a test for every sequence produced the following results:

Scenario Test failures
No change (i.e. the bug reported in #16) 86
Pre-sorting sequences, regexgen unchanged 92
regexgen bug fix, no pre-sorting 80
regexgen bug fix plus pre-sorting (this PR) 0
devongovett commented 7 years ago

Do you think we should do this sorting in regexgen? Might have benefits for other projects as well.

mathiasbynens commented 7 years ago

Yeah, sounds like the sorting should be done in regexgen.

mathiasbynens commented 7 years ago

Merging this now. We can remove the sorting here once it’s implemented in regexgen. Thanks, @gilmoreorless!

gilmoreorless commented 7 years ago

No worries! I wasn't sure about doing the sorting within regexgen itself, for 2 reasons:

  1. I thought this was a problem mostly caused by the very repetitive and combinatorial nature of the emoji ZWJ sequences, and wasn't sure how much it would affect other users of regexgen (can't hurt to add it though).
  2. I wasn't sure where it would go within regexgen. Auto-sorting by length within Trie#addAll would be easy, but if a user is repeatedly calling Trie#add with an unsorted list I don't know where the sorting would go.

Cheers for the quick merge.