Closed matthewvalentine closed 4 months ago
now keeps two extra translated strings in memory persistently
No strings are kept persistently.
Hmm, I stand corrected. It appears to be a relatively easy fix…
It turned out that:
I think my last commits (the current master) fixed the problem with matchAll()
.
@matthewvalentine — please retest and let me know if it works for you.
PS: And thank you for the repro script!
PPS: Another idea is to keep lastStringValue
as a weak pointer, so it can be collected by GC if needed. That's the proper way to do caches. I see what I can do about that.
@uhop With the code on master, I see no memory issue anymore using the repro script with replace
, replaceAll
, match
, matchAll
, test
or exec
.
I'll play around with the weak pointer idea and will release a new version. Thanks for finding the problem and helping to repro it!
Published as 1.21.1.
Thank you again for fixing #194! Though, one result of the fix is increased memory requirments if you use multiple regexes in a pipeline.
For example
now keeps two extra translated strings in memory persistently, whereas previously it would
I believe that (1) is a more important issue than (2). (2) just means holding onto the same memory for longer, but (1) means the actual max memory requirements go up, potentially a lot if you have a process that uses many regexes.
It would be hard to fix this in general. But I think it should be fixable at least for everything that goes through the whole string, by dropping the cache when it completes. Such as:
lastIndex
gets reset to 0replaceAll
that go all the way through the stringI tried to make a PR that calls
dropLastString()
in those places, but I wasn't able to get the memory usage to go down, I am not sure why. I might try again later.Another possible fix might be if there was a single global cache for all the RE2s. Although it seems a bit unclean, it would be guaranteed to have at most 1 string's worth of overhead, and should keep the linear performance on the assumption that people generally don't iterate through two different regexes simultaneously.
Here is a script that unambiguously displays the problem. On my machine, in 1.20.12 this uses 50 MB, while on 1.21 it uses 4 GB.