swiftlang / swift-experimental-string-processing

An early experimental general-purpose pattern matching engine for Swift.
Apache License 2.0
278 stars 47 forks source link

Optimize search for start-anchored regexes #682

Closed natecook1000 closed 1 year ago

natecook1000 commented 1 year ago

When a regex is anchored to the start of a subject, there's no need to search throughout a string for the pattern when searching for the first match: a prefix match is sufficient.

This adds a regex compilation-time check about whether a match can only be found at the start of a subject, and then uses that to choose whether to defer to prefixMatch from within firstMatch.

rdar://111392228

natecook1000 commented 1 year ago

@swift-ci Please test

natecook1000 commented 1 year ago

These are the benchmark results I see — I had to modify the AnchoredNotFound benchmark to include a firstMatch search, and it shows a big improvement. AFAICT the rest are noise?

=== Regressions ======================================================================
- CompilerMessages_All                    99.1ms    97.4ms  1.66ms      1.7%
- EmojiRegex_All                          64.1ms    63.1ms  932µs       1.5%
- EmojiRegex_All_Scalar                   41.4ms    40.5ms  923µs       2.3%
- EmailRFC_All_Scalar                     46.9ms    46.1ms  746µs       1.6%
- EmailLookaheadNoMatches_All             39ms  38.4ms  539µs       1.4%
- InvertedCCC_All_Scalar                  22.2ms    21.7ms  426µs       2.0%
- ReluctantQuant_Whole                    9.57ms    9.16ms  406µs       4.4%
- EmailLookahead_All                      40.5ms    40.1ms  370µs       0.9%
- SubtractionCCC_All                      20.2ms    19.8ms  347µs       1.7%
- IntersectionCCC_All                     20.5ms    20.2ms  318µs       1.6%
- LiteralSearch_All_Scalar                4.51ms    4.34ms  172µs       4.0%
- Css_All_Scalar                          3.01ms    2.87ms  138µs       4.8%
- AnchoredNotFound_All_Scalar             8.63ms    8.5ms   132µs       1.6%
- Numbers_All                             6.12ms    6ms 125µs       2.1%
- ReluctantQuantWithTerminal_Whole_Scalar 5.45ms    5.37ms  82.2µs      1.5%
- ReluctantQuantWithTerminal_Whole        5.44ms    5.38ms  55.3µs      1.0%
- IPv4Address_Scalar                      2.32ms    2.27ms  50.3µs      2.2%
=== Improvements =====================================================================
- AnchoredNotFound_First                  9.13ms    13.3ms  -4.16ms     -31.3%
- AnchoredNotFound_First_Scalar           5.62ms    8.63ms  -3.01ms     -34.9%
- EmailRFCNoMatches_All_Scalar            126ms 129ms   -2.11ms     -1.6%
- DiceRollsInText_All_Scalar              41.8ms    42.4ms  -645µs      -1.5%
- GraphemeBreakNoCap_All_Scalar           3.54ms    3.89ms  -346µs      -8.9%
- BasicRangeCCC_All_Scalar                9.5ms 9.84ms  -339µs      -3.4%
- BasicCCC_All_Scalar                     9.03ms    9.37ms  -335µs      -3.6%
- InvertedCCC_All                         21.5ms    21.7ms  -263µs      -1.2%
- EmailLookaheadList_Scalar               5.02ms    5.24ms  -226µs      -4.3%
- LiteralSearchNotFound_All_Scalar        4.16ms    4.38ms  -221µs      -5.0%
- GraphemeBreakNoCap_All                  4.25ms    4.45ms  -208µs      -4.7%
- EmailBuiltinCharacterClass_All_Scalar   10.5ms    10.7ms  -199µs      -1.9%
- BasicRangeCCC_All                       9.48ms    9.67ms  -195µs      -2.0%
- CaseInsensitiveCCC_All                  10.2ms    10.4ms  -185µs      -1.8%
- IPv6Address_Scalar                      2.76ms    2.93ms  -170µs      -5.8%
- BasicCCC_All                            8.99ms    9.15ms  -160µs      -1.8%
- LiteralSearchNotFound_All               5.1ms 5.26ms  -156µs      -3.0%
- IPv4Address                             2.53ms    2.66ms  -133µs      -5.0%
- Numbers_All_Scalar                      5.28ms    5.41ms  -122µs      -2.3%
- CaseInsensitiveCCC_All_Scalar           10.2ms    10.3ms  -119µs      -1.1%
- HangulSyllable_All_Scalar               5.09ms    5.21ms  -119µs      -2.3%
- LiteralSearch_All                       5.27ms    5.39ms  -115µs      -2.1%
- Lines_All_Scalar                        1.88ms    1.99ms  -107µs      -5.4%
- HangulSyllable_All                      6.01ms    6.09ms  -79.4µs     -1.3%
- Lines_All                               1.9ms 1.92ms  -18.9µs     -1.0%