Closed matthewvalentine closed 1 month ago
In fact, I know it is functionally broken because it fails the existing tests. Still, I'm look for feedback on the approach.
I split what used to be this comment into https://github.com/uhop/node-re2/issues/201 since it is about master and not this PR.
I was thinking about a fix like that:
Buffer
supports it directly.The only problem is to convert offsets from UTF-8 to UTF-16. Plus we need to convert buffers to strings. We can do matchAll()
as a post-processing step. So the whole matchAll()
can be a JS function. We will need a modified GetUtf16Length()
function, but it looks like you already started on that.
Relevant Buffer
APIs:
replaceAll()
is a little more involved because of callbacks from C++. That one can be implemented like this:
The post-processing step should be in the C++ code before/if calling a JS callback. We can indicate it using a custom flag, e.g., using '\b'
as an indicator. Checking this boolean flag can be added to the existing code, which does this conversion already (internally re2
works with UTF-8 buffers anyway).
This way the replaceAll()
code would include:
const re = new RE2(this, this.flags + '\b');
It will hide the fact that the source is a buffer.
A possible twist: we can do the post-processing for matchAll()
in C++ too if it is more efficient. I doubt it'll save much but the proof in the benchmarking. For that, we can use the same '\b'
flag.
I hope it makes sense. The idea was to minimize C++ changes yet fix inefficiencies. If readers think that my rambles were incoherent do not hesitate to ask questions and poke holes in the ideas.
This should fix #194. It's not complete in that it does not include sufficient comments, testing, or benchmarking (beyond confirming that it fixes the scaling issue). It may even be functionally broken. But I want to check that this is the general approach you'd prefer. There is very little change to C++ code, but the JS changes are fairly involved, and it's a very bolt-on-wrapper kind of solution.
There is presumably a certain amount of overhead for small inputs that didn't have scaling problems, due to the extra JS overhead. Doing the changes in C++ instead might avoid that. But I have not benchmarked it. I also don't know whether using the C++
getUtf16Length
as I am is faster or slower than doing the same logic in JS.In theory after these changes the C++
replace
function could be simplified to only deal with buffer/UTF-8 inputs and outputs, since JS is handling the conversion.Note that even with these changes, the scaling issue in #194 still exists for hand-written
exec
loops. That would be a much harder problem to solve. I think it would involve keeping state. And even with that, you'd still have to somehow be able to tell whether the input string toexec
is the same string as before, preferably without holding a strong reference to the string.