rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.53k stars 439 forks source link

UnicodeSetsMode support (`v` flag mode, `\q`) #1142

Open CAD97 opened 10 months ago

CAD97 commented 10 months ago

ECMAScript RegExp supports an interesting feature: v-mode enables \p to match string properties (e.g. \p{RGI_Emoji_Flag_Sequence}, roughly \p{Regional_Indicator}{2}), allows [] character classes to match finite-length strings, and adds the \q{regexp} escape for writing "string literals" in character classes (a nested regex expression where the only valid syntax is literals, escaped literals, and | disjunction).

This support is quite interesting because it allows doing set operations on (finite) sets of (finite) strings (e.g. grapheme clusters / emoji), which can enable matching patterns that might otherwise be achieved with lookahead, e.g. ^[\p{RGI_Emoji_Flag_Sequence}--\q{πŸ‡ΊπŸ‡Έ|πŸ‡¨πŸ‡³|πŸ‡·πŸ‡Ί|πŸ‡¬πŸ‡§|πŸ‡«πŸ‡·}]$ versus ^(?!πŸ‡ΊπŸ‡Έ|πŸ‡¨πŸ‡³|πŸ‡·πŸ‡Ί|πŸ‡¬πŸ‡§|πŸ‡«πŸ‡·)\p{Regional_Indicator}{2}$ (example from MDN). This functionality is required for implementing things such as the UAX#31 Emoji Profile for identifiers, which "requires the use of character sequences, rather than individual code points, in the sets Start and Continue defined by UAX31-D1."

I don't think the regex crate particularly needs to be carrying around additional Unicode tables[^icu4x] for extended \p, but \q seems somewhat reasonably straightforward to support without impacting users who don't use it (modulo some code size / compile time) and makes good chunks of the problem space at least possible to address, if still annoying[^gen]. Because \q is so restricted, it would even be sufficient to not permit alternation within the \q and require writing multiple \q instead, e.g. making the prior regex ^[\p{RGI_Emoji_Flag_Sequence}--[\q{πŸ‡ΊπŸ‡Έ}\q{πŸ‡¨πŸ‡³}\q{πŸ‡·πŸ‡Ί}\q{πŸ‡¬πŸ‡§}\q{πŸ‡«πŸ‡·}]]$, at no loss of expressiveness, just convenience.

[^icu4x]: At least not until such a day as icu4x allows it to be plausible for other crates to share the same underlying tables and/or for the binary package to have some control over what tables are built into the executable. Along that line, \p would be quite interesting β€” probably something like a fn(&self, &str) -> Option<impl ExactSizeIterator<Item=&str>> callback, to handle v-mode string sets, with a fast path for CodePointInversionLists β€” but that's drifting far off topic. [^gen]: Listing out each string in the relevant sets in |-delimited lists, oh my.

Is it possible that the regex crate could someday support \q in character classes? I do understand it could be a significant chunk of extra work to allow classes to match longer strings. Although, on the other hand, since sets already need to handle UTF-8's variable 1..=4 byte length codepoint encoding, generalizing that to 1.. bytes might not be completely different. And while that kind of processing could be done, it doesn't need to be for what is, ultimately, a quite niche feature, and is ultimately (after resolving set operations and longest-first ordering) no different from regular top-level alternation.

Concrete changes

If the general idea seems sound, I'm willing to try my hand at implementing \q (to be transparent to HIR, thus invisible to regex-automata).

References

BurntSushi commented 10 months ago

I confess that I am somewhat interested in supporting this. In particular, I've found that the existing support for class set operations (union, complement, intersection, difference) to be enormously useful when it comes to dealing with Unicode character classes. I'm really glad the regex crate has that support (not many regex engines do, notably PCRE and RE2 both lack it), because it makes some things expressible that would otherwise be quite difficult to do. I feel like this might fall into a similar bucket, and folks have filed issues about trying to match emoji with regex and it not working like they'd expect.

I have two general threads of concern here.

Firstly, based on the reading I've done (see the end of this comment for roughly what I read), it seems clear to me that the driving reason behind supporting "properties of strings" is Emoji. That is, I'm sure it's no coincidence that every single string property required by UTS#18 S2.7 is related to emoji. Therefore, I want to make sure that we can hit the Emoji use case. I haven't actually read UTS#51 in full detail yet, but there are aspects I don't yet understand. One is whether matching emoji correctly requires something other than a "simple" regex. For example, UTS#51 specifically provides a regex for matching possible emoji. That regex is already supported today by the regex crate (quite intentionally):

$ cat ~/tmp/emoji.regex
(?x)\p{RI} \p{RI}
| \p{Emoji}
  ( \p{EMod}
  | \x{FE0F} \x{20E3}?
  | [\x{E0020}-\x{E007E}]+ \x{E007F}
  )?
  (\x{200D}
    ( \p{RI} \p{RI}
    | \p{Emoji}
      ( \p{EMod}
      | \x{FE0F} \x{20E3}?
      | [\x{E0020}-\x{E007E}]+ \x{E007F}
      )?
    )
  )*

$ regex-cli find match meta -p "$(cat ~/tmp/emoji.regex)" -y '⚽'
     parse time:  138.955Β΅s
 translate time:  126.662Β΅s
build meta time:  1.446184ms
    search time:  145.104Β΅s
  total matches:  1
0:0:3:⚽

$ regex-cli find match meta -p "$(cat ~/tmp/emoji.regex)" -y 'πŸ‘¨πŸΎβ€βš•οΈ'
     parse time:  143.247Β΅s
 translate time:  104.561Β΅s
build meta time:  1.455044ms
    search time:  304.674Β΅s
  total matches:  1
0:0:17:πŸ‘¨πŸΎβ€βš•οΈ

But...

$ regex-cli find match meta -p "$(cat ~/tmp/emoji.regex)" -y '5'
     parse time:  125.205Β΅s
 translate time:  105.963Β΅s
build meta time:  1.433962ms
    search time:  123.723Β΅s
  total matches:  1
0:0:1:5

So clearly the "possible" part here is rather relevant, since the regex matches ASCII digits. Because I haven't read UTS#51 in full yet, I don't know what extra validity checking steps are needed here. And in general, I would want to know this before we start on adding support for this because I want to make sure that matching emoji is actually feasible to do. That is, if we had properties of strings, how would \p{RGI_Emoji} be implemented? Is it literally just a table of strings? If so, isn't that impractically large?

I'd also like to point out how long it takes to build the regex above. It's already at 1.5ms. I'm sure aspects of that could be further optimized, but I don't see it getting much better. In particular, large Unicode classes add a lot of overhead to regex compilation. Will properties of strings be even worse? If so, how much worse? I feel like big Unicode classes are already pushing the limits of what is acceptable and I worry that big Unicode sets of strings might be even worse.

The second general concern I have is implementation. While you astutely point out that the regex crate kind of already supports string classes by virtue of the fact that UTF-8 is a variable width encoding, this doesn't actually help too much. It's true that building the NFA shouldn't be an issue here, but there really is some significant issues with how the implementation would be adapted. I believe the fundamental issue here is that while UTF-8 is a variable width encoding, that isn't made manifest until the NFA is constructed. Before that, particularly the HIR, UTF-8 isn't really a thing. Instead, the HIR is a logical description of a regex matching characters, where a character can either be a single byte or a single codepoint. And the definition of character (among other things) can be changed by toggling Unicode mode via the u flag in the pattern. In particular, notice that character classes are limited only to characters implied by the current mode:

$ regex-cli debug hir '(?-u)β˜ƒ'
    parse time:  43.355Β΅s
translate time:  21.032Β΅s

Literal(
    "β˜ƒ",
)

$ regex-cli debug hir '(?-u)[β˜ƒ]'
failed to translate pattern 0 to HIR: '(?-u)[β˜ƒ]': failed to translate pattern: regex parse error:
    (?-u)[β˜ƒ]
          ^
error: Unicode not allowed here

There's a reason why this restriction exists, because mixing up what it means to be a "character" within a single character class is tricky and would require significant surgery to how character classes in the HIR are represented. Evolving character classes to arbitrary finite sets of strings will thus require significant surgery as well.

The place where this really comes up is in the implementation of Unicode character class set operations. You don't need to read it in depth, but the key fact to understand is that it is tightly coupled to the concept of Unicode codepoint ranges. (That is, a Vec<RangeInclusive<u32>>, where the ranges are sorted, not overlapping and not adjacent.)

My supsicion is that all of this would need to be completely gutted and rethought to account for arbitrary sets of finite strings.

The pain doesn't stop there unfortunately. The lowering from an HIR to an NFA (where UTF-8 is made manifest) also exploits the fact that classes are just ranges of characters. This is also where a lot of the time spent compiling large Unicode classes is spent. I'm nearly certain that the existing technique used, Daciuk's algorithm, probably cannot be easily adapted to finite sets of strings? Daciuk's algorithm requires that the elements in the set are sorted lexicographically, but UTS#18 requires that \q{lit1 | lit2 | ... | litN} has its literals sorted such that the longest matches first, which might conflict with the lexicographic ordering requirement. It's unclear to me whether some modification could make it work, but if Daciuk's algorithm doesn't work, then compiling finite sets of strings to an NFA will require a different path. That isn't a huge problem (as those paths already exist), but it will be more expensive both in terms of time, memory and match speed.

It's essentially a fancy wrapper around HashSet<String>.

Logically, yes. But that won't work as an implementation because it would require explicitly representing every codepoint in a class as a distinct element in a HashSet. Performance wise, that's a non-starter because it is very very common to have sets of codepoints with 100,000 codepoints.


Additional reading:

Note: I found it interesting that /sam|samwise/gv and /[\q{sam|samwise}]/gv do not have the same match semantics! The former will match sam in samwise, but the latter will match samwise. This is explicitly called out in UTS#18 S2.2.1:

In implementing Character Classes with strings, the expression /[a-m \q{ch|chh|rr|} Ξ²-ΞΎ]/ should behave as the alternation /(chh | ch | rr | [a-mΞ²-ΞΎ] | )/. Note that such an alternation must have the multi-code point strings ordered as longest-first to work correctly in arbitrary regex engines, because some regex engines try the leftmost matching alternative first. Therefore it does not work to have shorter strings first. The exception is where those shorter strings are not initial substrings of longer strings.

CAD97 commented 10 months ago

Daciuk's algorithm requires that the elements in the set are sorted lexicographically, but UTS#18 requires that \q{lit1 | lit2 | ... | litN} has its literals sorted such that the longest matches first, which might conflict with the lexicographic ordering requirement.

I think this should be possible to adapt fairly straightforwardly (though I very easily could have overlooked something) by building from reverse lexicographic order instead, which will naturally process longer strings before shorter ones. The simple set approach should probably be sufficient for sets manually authored with \q, since those are naturally bounded by practical size constraints on the regexp itself, but I don't have any insights to suggest w.r.t. generalized \p. Browsers do support the set operations, though, so maybe looking into how V8 and/or SpiderMonkey handle it might be useful? Although not really if it's done via lookahead assertions.

BurntSushi commented 10 months ago

Yes, I wouldn't expect the browser implementations to be of much help. They can use different techniques due to the use of backtracking and interpreter structure. Here, we really have to pound everything down into an FSM. Still, it might be worth looking at them to get a sense of what they do. I don't know for sure that it won't be helpful. But it is usually the case that backtracking and FSM engines use very different approaches.

I'm skeptical of Daciuk's algorithm being able to work in reverse, but I can't point out a flaw with it off the top of my head.

The simple set approach should probably be sufficient for sets manually authored with \q,

If it's just \q, then sure, that shouldn't be much of a problem. It's when they get thrown in with huge sets of codepoints that it becomes an issue I think.

CAD97 commented 8 months ago

I looked into UTS#51 a bit out of curiosity. As far as I can tell, there's no way to exactly match the UTS#51Β§1.4.6 emoji sets other than direct validity tests against the set inclusion list, potentially accelerated by the possible match EBNF as a scanner. (I haven't looked at UTS#18, but I did scan ECMA262 specifically to find the list of supported binary properties of strings.)

Even if we don't support string-sets, I do think recognizing (?v, \q, and string-set \p for more specific errors would be useful. I'm hoping to make a PR for such soonish (but anyone feel free to do so before me).

Even in the absence of "string class" set operations, string properties would still be interesting (and less effort) to support β€” it'd be sufficient for implementing UAX#31's emoji profile, for example. I might throw together a script to build these as regexes to see how regex (and logos and tree-sitter) handle them. A little bit of cleverness can build a more compact match than pure alternation, e.g. it currently (these properties are not stable between Unicode versions) holds that

regex golf is fun. Doing such lowering ahead of time is of course reliant on not doing set manipulation. I don't intend to represent any solution for handling dynamically built sets yet, and admit \p{RGI_Emoji} being 3,782 elements makes a naive[^set] HashSet<String> unlikely to work well. v-mode might end up being functionality that needs to be layered on top of the regex crate Γ  la fancy-regex, where it can take a scan-and-verify (lookaround) approach.

[^set]: You could do a bit better with a SmartString keeping short strings inline – about β…” β€” and/or symbolizing RGI sets and/or strings, but ultimately you're still manipulating a set with upwards of 211 elements and somehow need to lower that to the automata.


A final aside: the main problem with the standard possible emoji regex is matching strings that are definitely not emoji. It may be possible to construct a slightly stronger regex which doesn't fully constrain itself to RGI strings without greatly increasing match complexity. Disclaimer: I have not yet attempted to build and compare such a "likely emoji" test against the "possible emoji" test, and UTS#51 does acknowledge the concept as "many times" more complicated, while (by design) still requiring validity tests[^likely]. However, I posit that eliminating "obvious" false positives may be sufficiently useful for that cost even without further verification in sufficiently fuzzy applications. (Although it may enable wrongly bypassing said verification, since it works well enough to make the need nonobvious.)

[^likely]: E.g. refining \p{Emoji}(?&emoji_modification)? to \p{Emoji}(?&emoji_modification)|[\p{Emoji}&&\p{Emoji_Presentation=Yes}], which should already exclude the worst single-codepoint false positives like 5. But simply this may also potentially cause false negatives on valid ZWJ sequences, and I didn't check whether any such cases exist within the RGI set. Any such refinement must still accept all of \p{RGI_Emoji} to be considered valid, but is allowed to reject otherwise valid emoji sequences (even if treated like emoji by some fonts/vendors)

CAD97 commented 8 months ago

I might throw together a script to build these as regexes

Nerd sniped myself. Regexes and builder script, getting semi-reasonable regex sizes (except RGI_Emoji_ZWJ_Sequence which I haven't attempted to golf yet) https://gist.github.com/CAD97/9f14e8f691c59cb773d9458793c3cbb2

# this is nushell 0.90.1, working with data from Unicode Emoji, Version 15.1
~> let rgi_emoji = (emoji regex) | values | reverse | each { $"\(?:($in))" } | str join '|'
~> $rgi_emoji | str length
66131
~> regex-cli find match meta -p $rgi_emoji -y ⚽
     parse time:  2.497ms
 translate time:  3.613ms
build meta time:  8.388ms
    search time:  1.3928ms
  total matches:  1
0:0:3:⚽
~> regex-cli find match meta -p $rgi_emoji -y β€βš•οΈ
     parse time:  3.1441ms
 translate time:  3.8318ms
build meta time:  10.1181ms
    search time:  2.4802ms
  total matches:  1
0:3:9:βš•οΈ

~> let rgi_emoji_lite = (emoji regex) | reject RGI_Emoji_ZWJ_Sequence | values | reverse | each { $"\(?:($in))" } | str join '|'
~> $rgi_emoji_lite | str length
36950
~> regex-cli find match meta -p $rgi_emoji_lite -y ⚽
     parse time:  1.4788ms
 translate time:  2.8648ms
build meta time:  4.0901ms
    search time:  782.4Β΅s
  total matches:  1
0:0:3:⚽
~> regex-cli find match meta -p $rgi_emoji_lite -y β€βš•οΈ
     parse time:  1.4439ms
 translate time:  2.9059ms
build meta time:  3.9654ms
    search time:  970.4Β΅s
  total matches:  1
0:3:9:βš•οΈ

~> (emoji regex).Basic_Emoji | str length
2529
~> regex-cli find match meta -p (emoji regex).Basic_Emoji -y ⚽
     parse time:  126Β΅s
 translate time:  198.2Β΅s
build meta time:  577.9Β΅s
    search time:  55.6Β΅s
  total matches:  1
0:0:3:⚽

# comparison
~> regex-cli find match meta -p '(?x)\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\x{FE0F}\x{20E3}?|[\x{E0020}-\x{E007E}]+\x{E007F})?(\x{200D}(\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\x{FE0F}\x{20E3}?|[\x{E0020}-\x{E007E}]+\x{E007F})?))*' -y ⚽
     parse time:  81.6Β΅s
 translate time:  75.4Β΅s
build meta time:  674.1Β΅s
    search time:  69.4Β΅s
  total matches:  1
0:0:3:⚽
# apparently my CPU enjoys this task more than yours does πŸ˜…

This should at least give an estimate as to processing time for the "handle in the automata" approach. I'm not super hopeful on it being possible to do order-of-magnitude better without handling these string sets at a different layer than this.

BurntSushi commented 8 months ago

Can you try with regex-cli debug thompson -q? That will give us a little more targeted info. Particularly about how big these things are.

10ms doesn't excite me, but it is perhaps on the boundary of reasonableness for stuff like this.

CAD97 commented 8 months ago
~> regex-cli debug thompson -q $rgi_emoji
        parse time:  2.7808ms
    translate time:  3.811ms
  compile nfa time:  3.9529ms
            memory:  1367332
            states:  56156
       pattern len:  1
       capture len:  1
        has empty?:  false
          is utf8?:  true
       is reverse?:  false
   line terminator:  "\n"
       lookset any:  βˆ…
lookset prefix any:  βˆ…
~> regex-cli debug thompson -q $rgi_emoji_lite
        parse time:  1.4414ms
    translate time:  2.0979ms
  compile nfa time:  2.1962ms
            memory:  696444
            states:  28447
       pattern len:  1
       capture len:  1
        has empty?:  false
          is utf8?:  true
       is reverse?:  false
   line terminator:  "\n"
       lookset any:  βˆ…
lookset prefix any:  βˆ…

--shrink doesn't do anything for these, so either --shrink doesn't do anything for what's being checked or my patterns (included in the gist) are fairly close to irreducible. Parse time should be bypassable if regex-syntax (optionally) bundles prebuilt HIR for these sets.

BurntSushi commented 8 months ago

It's a bit of an implementation detail, but I hint at shrinking for NFAs being only applicable for reverse NFAs in the docs. (Pretty much all of the CLI options are designed to be in one-to-one correspondence with the config knobs in the library.)

For the forward direction, Daciuk's algorithm is used to build nearly minimal DFAs.

It's probably worth looking at the reverse NFAs too, since they will need to be built in most scenarios. They are typically bigger. Otherwise, the sizes for the forward NFAs look like they're about what I would expect.

CAD97 commented 8 months ago

Reverse nfa stats:

~> regex-cli debug thompson --captures none -rq $rgi_emoji
        parse time:  2.4019ms
    translate time:  3.5685ms
  compile nfa time:  3.2018ms
            memory:  1381396
            states:  56914
       pattern len:  1
       capture len:  0
        has empty?:  false
          is utf8?:  true
       is reverse?:  true
   line terminator:  "\n"
       lookset any:  βˆ…
lookset prefix any:  βˆ…
~> regex-cli debug thompson --captures none -rq $rgi_emoji --shrink
        parse time:  2.5419ms
    translate time:  3.9212ms
  compile nfa time:  4.3885ms
            memory:  1386980
            states:  56292
       pattern len:  1
       capture len:  0
        has empty?:  false
          is utf8?:  true
       is reverse?:  true
   line terminator:  "\n"
       lookset any:  βˆ…
lookset prefix any:  βˆ…

~> regex-cli debug thompson --captures none -rq $rgi_emoji_lite
        parse time:  1.4259ms
    translate time:  2.0568ms
  compile nfa time:  1.659ms
            memory:  710580
            states:  29208
       pattern len:  1
       capture len:  0
        has empty?:  false
          is utf8?:  true
       is reverse?:  true
   line terminator:  "\n"
       lookset any:  βˆ…
lookset prefix any:  βˆ…
~> regex-cli debug thompson --captures none -rq $rgi_emoji_lite --shrink
        parse time:  1.4384ms
    translate time:  2.0845ms
  compile nfa time:  2.98ms
            memory:  716092
            states:  28583
       pattern len:  1
       capture len:  0
        has empty?:  false
          is utf8?:  true
       is reverse?:  true
   line terminator:  "\n"
       lookset any:  βˆ…
lookset prefix any:  βˆ…

so roughly the same size as forward, with a bit more overhead for the cases I've encoded like a[a-z] than those encoded like aa|ab|ac|....

Perhaps worth reiterating, it should be possible to pre-build the Daciuk trie for these, assuming it can handle longest-first semantics. I don't know how difficult it'd be to actually remember the prebuilt trie, to splice a prebuilt NFA subgraph into the meta engine, nor to symbolically thread that set through, but that would cut down regex compilation time down for the "uses predefined sets" case.