oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.
https://www.graalvm.org/ruby/
Other
3.01k stars 184 forks source link

Regex line anchors much slower than global anchors #2429

Open nirvdrum opened 3 years ago

nirvdrum commented 3 years ago

While investigating regex performance, I've noticed that line anchors are much slower than global anchors, even when a string consists of a single line. Most regexps that I've encountered that use anchors use the line variants. I've tried with a variety of patterns and the effect is visible in each case:

# frozen_string_literal: true

require 'benchmark/ips'

SUBJECT = "abc"

Benchmark.ips do |x|
  x.report('blank? check (global)') do
    SUBJECT.match?(/\A[[:space:]]*\z/)
  end

  x.report('blank? check (line)') do
    SUBJECT.match?(/^[[:space:]]*$/)
  end

  x.report('exact match (global)') do
    SUBJECT.match?(/\Aabc\z/)
  end

  x.report('exact match (line)') do
    SUBJECT.match?(/^abc$/)
  end

  x.report('wildcard check (global)') do
    SUBJECT.match?(/\A.*\z/)
  end

  x.report('wildcard check (line)') do
    SUBJECT.match?(/^.*$/)
  end

  x.report('start only (all)') do
    SUBJECT.match?(/\Aabc/)
  end

  x.report('start only (line)') do
    SUBJECT.match?(/^abc/)
  end

  x.report('end only (all)') do
    SUBJECT.match?(/abc\z/)
  end

  x.report('end only (line)') do
    SUBJECT.match?(/abc$/)
  end
end
> jt ruby -v regexp-anchor-benchmarks.rb
Using TruffleRuby with Graal: mxbuild/truffleruby-jvm-ce
$ /home/nirvdrum/dev/workspaces/truffleruby-ws/truffleruby/mxbuild/truffleruby-jvm-ce/languages/ruby/bin/ruby \
  --experimental-options \
  --core-load-path=/home/nirvdrum/dev/workspaces/truffleruby-ws/truffleruby/src/main/ruby/truffleruby \
  -v \
  regexp-anchor-benchmarks.rb
truffleruby 21.3.0-dev-ac4aa22b, like ruby 2.7.3, GraalVM CE JVM [x86_64-linux]
Warming up --------------------------------------
blank? check (global)
                         5.800M i/100ms
 blank? check (line)    13.234M i/100ms
exact match (global)    12.170M i/100ms
  exact match (line)     9.117M i/100ms
wildcard check (global)
                        13.231M i/100ms
wildcard check (line)
                        12.099M i/100ms
    start only (all)    14.506M i/100ms
   start only (line)    10.191M i/100ms
      end only (all)    13.645M i/100ms
     end only (line)    10.353M i/100ms
Calculating -------------------------------------
blank? check (global)
                        178.517M (± 4.5%) i/s -    893.147M in   5.018551s
 blank? check (line)    131.881M (± 4.3%) i/s -    661.704M in   5.027756s
exact match (global)    136.003M (± 8.2%) i/s -    681.540M in   5.052745s
  exact match (line)     99.360M (± 7.9%) i/s -    501.429M in   5.084205s
wildcard check (global)
                        126.612M (± 5.8%) i/s -    635.083M in   5.034580s
wildcard check (line)
                        122.111M (± 7.2%) i/s -    617.074M in   5.086444s
    start only (all)    140.055M (± 5.9%) i/s -    710.804M in   5.096991s
   start only (line)    100.616M (± 8.1%) i/s -    499.345M in   5.005298s
      end only (all)    134.511M (± 8.4%) i/s -    668.615M in   5.014038s
     end only (line)     97.928M (±10.4%) i/s -    486.580M in   5.031653s

NB: I have not investigated the \Z anchor.

bjfish commented 3 years ago

cc: @jirkamarsik @djoooooe

djoooooe commented 3 years ago

I would say that this is to be expected, since line anchors are more complex than global anchors - for example: /\Aabc\z/ is equivalent to "abc".equals(SUBJECT), which of course is not allowed when handling /^abc$/. The latter translates to something like this (in pseudo-Java):

if (SUBJECT.startsWith("abc")) {
    return ...;
}
int pos =  SUBJECT.indexOf('\n');
while (pos >= 0) {
    if (SUBJECT.regionEquals(pos, "abc") && (SUBJECT.charAt(pos+3) == '\n' || pos+3 == SUBJECT.length())) {
        return ...;
    }
    pos =  SUBJECT.indexOf('\n', pos);
}
nirvdrum commented 3 years ago

I see what you're saying for the general case, but by far the most common case that I'm seeing is a SUBJECT without any newline characters at all. I strongly suspect the next most common case is a SUBJECT that contains a newline, but still matches from position 0 (i.e., it doesn't match from the middle of the subject).

Adapting your pseudocode, I'd expect something more along the lines of:

if (SUBJECT.startsWith(pattern) && (SUBJECT.length() == pattern.length() || SUBJECT.charAt(pattern.length()) == '\n') {
    // Profile for most common success case.
    return ...;
}

int pos =  SUBJECT.indexOf('\n');

if (pos == -1) {
    // Profile for most common failure case.
    return failure;
} else {
    // Profile for least-most common case... probably never even encountered.

    while (pos >= 0) {
        if (SUBJECT.length() - pos > pattern.length()) {
            return failure;
        }

        if (SUBJECT.regionEquals(pos, pattern) && (SUBJECT.charAt(pos + pattern.length()) == '\n' || pos + pattern.length() == SUBJECT.length())) {
            return ...;
        }

        pos =  SUBJECT.indexOf('\n', pos);
    }
}

I'm glossing over some things here. Foremost, I'm completely ignoring multi-byte characters. But that's okay be in the case I'm benchmarking, as both the pattern and the subject consist only of ASCII characters. I also don't know which String operations are intrinsified and that may account for a performance hit. I'm assuming there are no reference equality checks being performed and that in either the global or line anchor case, you have to do a byte walk on the subject string. That likely implies using ArrayUtils with a single walk through the subject string.

nirvdrum commented 3 years ago

FYI, I've updated the original benchmarks and results to include cases where only one of the anchors is being used. That way, we can consider cases other than exact string equality matches.

nirvdrum commented 3 years ago

For what it's worth, in my extremely non-scientific experience, I've rarely ever encountered anyone using \A and \z. I'd hazard to say most Rubyists don't even know those exist, especially those coming from other languages that may only support ^ and $ and possibly some sort of multi-line modifier. In MRI, there's largely no performance difference between the two in the cases from the benchmarks:

> ruby -v regexp-anchor-benchmarks.rb 
ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-linux]
Warming up --------------------------------------
blank? check (global)
                       833.383k i/100ms
 blank? check (line)   643.638k i/100ms
exact match (global)   781.592k i/100ms
  exact match (line)   810.349k i/100ms
wildcard check (global)
                       781.245k i/100ms
wildcard check (line)
                       766.786k i/100ms
    start only (all)   797.408k i/100ms
   start only (line)   832.383k i/100ms
      end only (all)   821.866k i/100ms
     end only (line)   864.258k i/100ms
Calculating -------------------------------------
blank? check (global)
                          8.282M (± 1.9%) i/s -     41.669M in   5.033028s
 blank? check (line)      6.421M (± 2.2%) i/s -     32.182M in   5.014957s
exact match (global)      7.954M (± 0.8%) i/s -     39.861M in   5.011967s
  exact match (line)      8.139M (± 2.0%) i/s -     41.328M in   5.079836s
wildcard check (global)
                          7.720M (± 1.0%) i/s -     39.062M in   5.060545s
wildcard check (line)
                          7.785M (± 1.9%) i/s -     39.106M in   5.024855s
    start only (all)      8.101M (± 2.4%) i/s -     40.668M in   5.023193s
   start only (line)      8.067M (± 3.3%) i/s -     40.787M in   5.061986s
      end only (all)      7.959M (± 3.2%) i/s -     40.271M in   5.065631s
     end only (line)      8.243M (± 1.8%) i/s -     41.484M in   5.034518s

Now, in these cases, TRegex is already way faster than MRI's regexps. But, a fundamental goal in TruffleRuby is to optimize Ruby the way people really write Ruby, not the way we wish them to write Ruby. I'd like to avoid discussions like "Oh, you should really use \A instead of ^ if you're dealing with single-line strings because it's 33% faster." Maybe it's unavoidable, but it looks to me like we could further optimize the most common cases.

djoooooe commented 3 years ago

I see what you mean, and i'll put it on the list of possible future optimizations, but keep in mind that optimizations like this are speculating on the input string's content and introduce additional deopts. MRI is equally slow in both cases (more than 10x slower than TRegex in all cases) because it doesn't JIT compile regex matchers, so i wouldn't really say that that's an argument in favor of MRI's behavior.

nirvdrum commented 3 years ago

The note about MRI is mostly to avoid entries like you see in the fast-ruby project, where developers end up changing the way they write code in order to better exploit a temporary execution difference between two equivalent methods.

As for the speculation, I think it's warranted here. TruffleRuby now has a regex profiler, which I've run against a series of benchmarks from a production web application here. It involves 24 benchmarks in total, each stressing a Rack middleware with representative inputs. Some high level statistics:

  Regexp Total #:                                        670
  Used:                                                  314  (46.9%)
  Unused:                                                356  (53.1%)

  Start of line/string anchor (^ OR \A):                 152  (22.7%)
  Start of line/anchor (^):                               84  (12.5%)
  Start of string anchor (\A):                            68  (10.1%)

  End of line/string anchor ($ OR \z OR \Z):             219  (32.7%)
  End of line anchor ($):                                102  (15.2%)
  End of string anchor (\z):                              53  (07.9%)
  End of string with \n exception anchor (\Z):             0  (00.0%)

  Start & end of line/string anchors (any type):          97  (14.5%)

Out of 670 total regexps, 22.7% of them use some sort of starting anchor and 32.7% use some sort of ending anchor. It's a fairly common feature to use. Of the anchor types, the start and end of line anchors are more common. None of them use the \Z anchor, so we can make that slow path (or kick that over to Joni).

These patterns come from the application, its dependency graph, and the Ruby core and standard libraries. Consequently, about half of the profiled regexps are never matched against. E.g, Psych (Ruby's YAML parser) uses anchored expressions in its parser. If that YAML functionality is never called, then we don't need any of the regexps in there. The numbers provided above are not normalized. Restricting consideration just to those regexps that are used, we have:

  Start of line/string anchor (^ OR \A):                  57  (18.2%)
  Start of line/anchor (^):                               33  (10.5%)
  Start of string anchor (\A):                            24  (07.6%)

  End of line/string anchor ($ OR \z OR \Z):              43  (13.7%)
  End of line anchor ($):                                 20  (06.4%)
  End of string anchor (\z):                              21  (06.7%)
  End of string with \n exception anchor (\Z):             0  (00.0%)

  Start & end of line/string anchors (^...$):             35  (11.1%)

That trims the set of regexps using regexps using a start anchor a little bit. It has a much larger impact on those using and ending anchor. The used regexps are important because those are what we're seeing used in a production application. The unused regexps are still interesting from a holistic view since they can help inform which regexp features appear in real code, but probably aren't pertinent to the proposed optimization here.

The last thing I looked into is whether the strings being matched against have embedded newlines. Based on the numbers presented above, it's more common to see ^ and $ used than \A and \z. The YAML snippet I linked to uses line anchors exclusively, but as far as I can tell, the tag being matched against can't have newlines (or at least doesn't in practice). I updated the regexp profiler to record info about match strings and whether they contain newlines:

  Start of line anchor count:                             33
  Start of line anchor matches:                   14,561,345
  Start of line anchor match string has \n:                0

  Start of string anchor count:                           24
  Start of string anchor matches:                 14,884,834
  Start of string ring anchor match string has \n:         0

  End of line anchor count:                                5
  End of line anchor matches:                        457,137
  End of line anchor match string has \n:                  0

  End of string anchor count:                              3
  End of string anchor matches:                    3,798,471
  End of string ring anchor match strings has \n:          0

For these numbers, I aggregate the data based on its anchor type. The match count is somewhat arbitrary, since it's a function of how long I ran each of the middleware benchmarks. But, anything in the millions is running on the hot path. I was a bit surprised to see that none of these cases had strings with embedded newlines. I've verified that data collected on other regexp patterns do have match strings with embedded newlines, so I'm reasonably certain it's not a flaw in the measurement.

I think the takeaways from this are:

Just to reiterate, this data was collected from a web application and that likely impacts the type of match strings we see. For example, there's a lot of matching against HTTP request headers. While such headers are delineated by newline characters, the values are extracted by the HTTP parser in the Puma web server. The resulting header strings may have embedded newlines (permitted by RFC 2616, deprecated by RFC 7230), but in practice web browsers don't embed newlines in request headers. Additionally, the same HTTP header string is often matched against multiple regular expressions. E.g., a User-Agent parser may check the header against a list of patterns to determine which browser and platform a user is using. Which is all to say that many matches against strings without embedded newlines is likely the norm for web applications. Fortunately, that's a popular domain for Ruby. Other applications of Ruby may have a different execution profile.

One tricky question I don't have an answer for is guarding costly deoptimization from poisoned user input. But, I think that's a concern in general.

djoooooe commented 3 years ago

Thanks for the analysis! Could you provide a list of the regexes using ^ and $ encountered in this experiment?

eregon commented 3 years ago

For what it's worth, in my extremely non-scientific experience, I've rarely ever encountered anyone using \A and \z. I'd hazard to say most Rubyists don't even know those exist, especially those coming from other languages that may only support ^ and $ and possibly some sort of multi-line modifier.

Thanks for the statistics in the latest comment, we see they are used almost as often as ^/$, which match my expectations. If the String cannot contain newlines by design then it doesn't matter which is used, but if it can it's a potential security or correctness issue and I would expect Rubyists know about \A and \z (especially in web apps).

The note about MRI is mostly to avoid entries like you see in the fast-ruby project, where developers end up changing the way they write code in order to better exploit a temporary execution difference between two equivalent methods.

I think it's fair to say that \A \z will always be as fast or faster than ^ $, it has simpler semantics intrinsically. So that's what I would say a valid tip to use \A \z over ^ $ when appropriate, and regardless of performance always consider \A \z.

Of course it's valuable to optimize ^ & better if we can in TruffleRuby & TRegex. @djoooooe Could you provide a patch or measure with the added profile (for no newline)? Does it reach the same performance as \A \z then for strings with no newlines?

djoooooe commented 3 years ago

Unfortunately, adding this speculation is not as straightforward as shown in the example above, i'll come back to this as soon as my schedule allows