uhop / node-re2

node.js bindings for RE2: fast, safe alternative to backtracking regular expression engines.
Other
479 stars 53 forks source link

matchAll and replaceAll scale nonlinearly #194

Closed matthewvalentine closed 1 month ago

matthewvalentine commented 9 months ago

Observed behavior

My goal is to get a list of all matches in a string. In production, switching from native regexes to RE2 caused an outage because RE2 has worse scaling behavior on that task, at least in some cases. Something that had previously taken milliseconds ended up taking minutes.

In benchmarking a toy input (a string of N xs, matching the regex /x/g), I was surprised to find two things:

  1. Both matchAll and replaceAll scale nonlinearly in the number of matches (or maybe the input size). Specifically, they seem to be almost O(n^2). 10x longer input produces about 100x longer runtime. Native regexes do not have this behavior. In my crude benchmarking, replaceAll is rock-steady linear. And matchAll might be just barely super-linear.
  2. Using replaceAll with a replacer function that accumulates the matches into a list manually is faster (about 3x for me) than using matchAll. This seems true for native regexes as well, though not to the same degree.

The benchmarking is not robust at all, I wouldn't trust it for absolute numbers, especially since the input is very unrealistic, but it's plenty to see the scaling behavior.

Desired behavior

I have no expectation that node-re2 would out-perform native regexes here. But I do expect it to have good scaling. And I especially expect it to have scaling at least as good as native regexes. So, linear scaling would be my primary desire here. (Or - whatever the scaling that native regexes have, maybe that's not linear for some inputs. Other than the ones that trigger non-linear individual match behavior, ofc.)

Second and not very important, it's weird to me that replaceAll is the fastest way to get a list of matches, as I would have thought it must have quite a bit of overhead in constructing the new string. I suspect this must be from matchAll being iterator-based. If so, maybe it would be nice if node-re2 would offer some kind of special "get list of all matches" function that performs faster than either of them. Particularly if, for some strange reason, that could be done linearly but matchAll/replaceAll as spec'd cannot.

As far as I can tell, these two functions are really part of the node-re2 package more than they are the underlying RE2 library, so I don't think fixing the scaling is likely to require a change to RE2 itself. But I could be wrong.

## Benchmarking code and results This was on Node v18.16.1 with RE2 1.20.5. Code: ```js const re2 = require('re2'); function matchAll(input, regex) { return [...input.matchAll(regex)]; } function replaceAll(input, regex, replacement) { const matches = []; input.replaceAll(regex, m => { matches.push(m); return replacement; }); return matches; } function run() { console.log('Size matchAll replaceAllEmpty replaceAllSame replaceAllDifferent'); for (let i = 0; ; i++) { const n = Math.floor(Math.pow(10, 3 + i/3)); console.log(n); for (const native of [true, false]) { const { matchAll, replaceAllEmpty, replaceAllSame, replaceAllDifferent } = benchmark(n, 1, native); function logRes(title, div) { console.log( ` ${title}`.padEnd(15), (matchAll / div).toFixed(3).padEnd(13), (replaceAllEmpty / div).toFixed(2).padEnd(18), (replaceAllSame / div).toFixed(2).padEnd(17), (replaceAllDifferent / div).toFixed(2) ); } console.log(native ? ' Native' : ' RE2'); logRes('Per 1k char', n / 1000); logRes('Per call', 1); console.log(''); } } } function benchmark(inputSize, iters, native) { const input = (new Array(inputSize)).join('x'); let regex = /x/g; if (!native) regex = new re2(regex); return { matchAll: timeit(iters, () => matchAll(input, regex)) / iters, replaceAllEmpty: timeit(iters, () => replaceAll(input, regex, '')) / iters, replaceAllSame: timeit(iters, () => replaceAll(input, regex, 'x')) / iters, replaceAllDifferent: timeit(iters, () => replaceAll(input, regex, 'yyyyy')) / iters, }; } function timeit(n, fn) { const start = Date.now(); for (let i = 0; i < n; i ++) { fn(); } return Date.now() - start; } ``` And here is some output, I stopped when it started taking too long (the numbers are ms): ``` Size matchAll replaceAllEmpty replaceAllSame replaceAllDifferent 1000 Native Per 1k char 0.000 0.00 0.00 0.00 Per call 0.000 0.00 0.00 0.00 RE2 Per 1k char 5.000 1.00 1.00 1.00 Per call 5.000 1.00 1.00 1.00 2154 Native Per 1k char 0.000 0.00 0.46 0.00 Per call 0.000 0.00 1.00 0.00 RE2 Per 1k char 5.107 1.39 1.39 1.39 Per call 11.000 3.00 3.00 3.00 4641 Native Per 1k char 0.215 0.00 0.22 0.00 Per call 1.000 0.00 1.00 0.00 RE2 Per 1k char 6.464 1.51 1.51 1.51 Per call 30.000 7.00 7.00 7.00 10000 Native Per 1k char 0.100 0.10 0.00 0.10 Per call 1.000 1.00 0.00 1.00 RE2 Per 1k char 7.800 2.40 2.50 2.50 Per call 78.000 24.00 25.00 25.00 21544 Native Per 1k char 0.046 0.05 0.05 0.05 Per call 1.000 1.00 1.00 1.00 RE2 Per 1k char 15.642 5.06 5.01 5.06 Per call 337.000 109.00 108.00 109.00 46415 Native Per 1k char 0.065 0.04 0.04 0.04 Per call 3.000 2.00 2.00 2.00 RE2 Per 1k char 33.330 10.45 10.45 10.56 Per call 1547.000 485.00 485.00 490.00 100000 Native Per 1k char 0.080 0.04 0.04 0.06 Per call 8.000 4.00 4.00 6.00 RE2 Per 1k char 66.640 22.24 22.11 22.12 Per call 6664.000 2224.00 2211.00 2212.00 215443 Native Per 1k char 0.097 0.04 0.07 0.04 Per call 21.000 8.00 15.00 9.00 RE2 Per 1k char 142.116 47.44 47.96 47.87 Per call 30618.000 10221.00 10333.00 10314.00 464158 Native Per 1k char 0.153 0.05 0.04 0.05 Per call 71.000 25.00 20.00 24.00 RE2 Per 1k char 305.876 102.56 102.59 102.92 Per call 141975.000 47605.00 47616.00 47771.00 1000000 Native Per 1k char 0.191 0.04 0.04 0.05 Per call 191.000 43.00 45.00 48.00 RE2 Per 1k char 654.652 219.59 219.46 219.85 Per call 654652.000 219593.00 219463.00 219851.00 ```
matthewvalentine commented 9 months ago

I should probably mention that it looked like a manual .exec() loop wasn't any faster than matchAll, but I didn't investigate that closely.

matthewvalentine commented 9 months ago

I have discovered that match with a global regex appears to scale linearly for this test. So

  1. If I can confirm that, then that's a good mitigation for me right now.
  2. It does suggest this should possibly be a solvable problem for matchAll, whatever the issue may be.
  3. Maybe, if it isn't solvable, this is ultimately down to capture groups? (Even though my example regex does not have any.) Since that's something match does not provide, but I believe matchAll does.
uhop commented 9 months ago

You do aware that this function is quadratic:

function matchAll(input, regex) {
  return [...input.matchAll(regex)];
}

Specifically, the [...iterator] part, which adds new elements at the end reallocating its array from time to time.

It looks like replaceAll() does the same but differently.

You can even the field by using a loop:

function matchAll(input, regex) {
  const matches = [];
  for (const m of input.matchAll(regex)) {
    matches.push(m);
  }
  return matches;
}

But I doubt it'll be much different. It is likely that you measured the difference between a callback I call directly from C++ and a Node's implementation of iterators. See matchAll().

While RegExp uses a primitive algorithm (last time I checked) and used to be slower than RE2 many versions ago, nowadays it is JITed, which makes it fast in most cases. The only reasonable use case for RE2 is when we deal with "bad input", or cannot control/sanitize it, e.g., it comes from users. You can find examples of such inputs custom-tailored for a regular expression used in DOS attacks.

One user even built an evaluator for that: https://github.com/vt-iwamoto/node-re2-benchmark (listed in this project's wiki).

matthewvalentine commented 9 months ago

I fully expect RE2 to be much slower on average. Like you said, the goal here is predictable performance on bad input. This is a kind of input (a string with lots of matches) where it is not bad input for RegExp but IS bad input for node-re2.

You do aware that this function is quadratic: Specifically, the [...iterator] part, which adds new elements at the end reallocating its array from time to time.

As far as I know appending to an array is amortized constant because it only has to reallocate a logarithmic number of times. No different from std::vector. So, apart from what the iterator is doing internally, I think that the loop itself is linear.

It is likely that you measured the difference between a callback I call directly from C++ and a Node's implementation of iterators.

Makes sense for overall performance, but surely it can't explain a difference in scaling? The callback is only executed a linear number of times.

In comparing RegExp and node-re2, the observed scaling is drastically different, even though the loop logic is identical. Different not just in constant multiple, but in exponent. It seems to me the difference can only come from what the iterator is doing internally (so, RegExp vs node-re2).

Also, I did confirm that node-re2's .match() function (with global flag) appears to be linear for this task. So it doesn't seem like a fundamental limitation of the task, nor of RE2 itself. We could throw away all the talk of RegExp and rephrase this issue ".matchAll() scales worse than .match()". Again, I'd expect it to perform worse, but not to scale worse.

uhop commented 8 months ago

As far as I know appending to an array is amortized constant because it only has to reallocate a logarithmic number of times. No different from std::vector.

I think you are right about that.

It seems to me the difference can only come from what the iterator is doing internally (so, RegExp vs node-re2).

Again, here is a link to the node-re2's implementation of matchAll() — about a dozen lines. Feel free to improve it and submit a PR.

matthewvalentine commented 8 months ago

I've confirmed this is entirely from UTF-16 / UTF-8 translation. When using Buffer input and .useBuffers = true on the replacer, the nonlinearity disappears from both .matchAll() and .replace[All]().

This makes sense because, currently, the conversion runs on the entire input string for each call to .exec(), which means each iteration of .matchAll(). And UTF-16 index calculations are done from the beginning of the string, both for .exec() and .replace(). Meanwhile .match() only does the string conversion once, and does not have to return an index for each match.

uhop commented 8 months ago

It is good actionable data. We can consider moving it to C++, or switching to a buffer version in JS. TBH, the C++ implementation is the bigger endeavor that can require some code restructuring to reuse the internal implementation of exec() keeping JS bridging (and type coercion) separate.

In any case, PRs are welcome.

uhop commented 1 month ago

@matthewvalentine please verify that it scales better now.

I did my own benchmarking with this code:

'use strict';

const RE2 = require('./re2.js');

const sample = 'a'.repeat(100_000);
const re = new RE2('a', 'g');

const result = Array.from(sample.matchAll(re));

console.log(result.length);

I ran it against the new code and 1.20.12. Changing the repeat number shows a linear change in the performance.

The good thing about this code is that it is universal and covers a wide range of possible use cases, including matchAll().

matthewvalentine commented 1 month ago

@uhop Yes, the nonlinearity is gone from my tests as well! Thank you so much

uhop commented 1 month ago

Published as 1.21.0.