uhop / node-re2

node.js bindings for RE2: fast, safe alternative to backtracking regular expression engines.
Other
493 stars 53 forks source link

Memory usage regression in 1.21 #210

Closed matthewvalentine closed 4 months ago

matthewvalentine commented 4 months ago

Thank you again for fixing #194! Though, one result of the fix is increased memory requirments if you use multiple regexes in a pipeline.

For example

s.replace(whitespaceRegex, ' ').replace(weirdCharacterRegex, '')

now keeps two extra translated strings in memory persistently, whereas previously it would

  1. only have one in memory at a time, and
  2. not keep it persistently after the replacing is done.

I believe that (1) is a more important issue than (2). (2) just means holding onto the same memory for longer, but (1) means the actual max memory requirements go up, potentially a lot if you have a process that uses many regexes.

It would be hard to fix this in general. But I think it should be fixable at least for everything that goes through the whole string, by dropping the cache when it completes. Such as:

I tried to make a PR that calls dropLastString() in those places, but I wasn't able to get the memory usage to go down, I am not sure why. I might try again later.

Another possible fix might be if there was a single global cache for all the RE2s. Although it seems a bit unclean, it would be guaranteed to have at most 1 string's worth of overhead, and should keep the linear performance on the assumption that people generally don't iterate through two different regexes simultaneously.

Here is a script that unambiguously displays the problem. On my machine, in 1.20.12 this uses 50 MB, while on 1.21 it uses 4 GB.

node --expose-gc script.js
const RE2 = require('./re2');
let s = '';
for (let i = 0; i < 20 * 1024 * 1024; i++) {
    s += 'a';
}
const regexes = [];
for (let i = 0; i < 200; i++) {
    const r = new RE2('x', 'g');
    regexes.push(r);
    s.replace(r, '');
    global.gc();
}
console.log('Done');
while (true) {}
uhop commented 4 months ago

now keeps two extra translated strings in memory persistently

No strings are kept persistently.

uhop commented 4 months ago

Hmm, I stand corrected. It appears to be a relatively easy fix…

uhop commented 4 months ago

It turned out that:

I think my last commits (the current master) fixed the problem with matchAll().

@matthewvalentine — please retest and let me know if it works for you.

PS: And thank you for the repro script!

PPS: Another idea is to keep lastStringValue as a weak pointer, so it can be collected by GC if needed. That's the proper way to do caches. I see what I can do about that.

matthewvalentine commented 4 months ago

@uhop With the code on master, I see no memory issue anymore using the repro script with replace, replaceAll, match, matchAll, test or exec.

uhop commented 4 months ago

I'll play around with the weak pointer idea and will release a new version. Thanks for finding the problem and helping to repro it!

uhop commented 4 months ago

Published as 1.21.1.