uhop / node-re2

node.js bindings for RE2: fast, safe alternative to backtracking regular expression engines.
Other
479 stars 53 forks source link

Do UTF-16 / UTF-8 conversion in JS for matchAll and replace, fixes #194 #200

Closed matthewvalentine closed 1 month ago

matthewvalentine commented 7 months ago

This should fix #194. It's not complete in that it does not include sufficient comments, testing, or benchmarking (beyond confirming that it fixes the scaling issue). It may even be functionally broken. But I want to check that this is the general approach you'd prefer. There is very little change to C++ code, but the JS changes are fairly involved, and it's a very bolt-on-wrapper kind of solution.

There is presumably a certain amount of overhead for small inputs that didn't have scaling problems, due to the extra JS overhead. Doing the changes in C++ instead might avoid that. But I have not benchmarked it. I also don't know whether using the C++ getUtf16Length as I am is faster or slower than doing the same logic in JS.

In theory after these changes the C++ replace function could be simplified to only deal with buffer/UTF-8 inputs and outputs, since JS is handling the conversion.

Note that even with these changes, the scaling issue in #194 still exists for hand-written exec loops. That would be a much harder problem to solve. I think it would involve keeping state. And even with that, you'd still have to somehow be able to tell whether the input string to exec is the same string as before, preferably without holding a strong reference to the string.

matthewvalentine commented 7 months ago

In fact, I know it is functionally broken because it fails the existing tests. Still, I'm look for feedback on the approach.

matthewvalentine commented 7 months ago

I split what used to be this comment into https://github.com/uhop/node-re2/issues/201 since it is about master and not this PR.

Kept for posterity. Hm, in trying to figure out why it breaks the tests, I've either come across a different bug or I am misunderstanding something. On master, without these changes, the indexes returned by `exec`ing a Buffer input seem incorrect. This code: ```js const RE2 = require('./re2'); const r = new RE2('.', 'g'); const b = Buffer.from('test1test2'); while ((m = r.exec(b))) { console.log(m[0].toString(), m.index, r.lastIndex, String.fromCharCode(b[m.index])); } ``` produces this output: ``` t 0 1 t e 2 2 s s 4 3 1 t 6 4 e 1 8 5 t t 10 6 e 12 7 s 14 8 t 16 9 2 18 10 ``` Compared to the same sort of thing with a string input: ``` t 0 1 t e 1 2 e s 2 3 s t 3 4 t 1 4 5 1 t 5 6 t e 6 7 e s 7 8 s t 8 9 t 2 9 10 2 ``` The `.index` value in the Buffer input case doesn't seem to correspond to the value returned as the match, and also doesn't line up with the resulting `.lastIndex`. In fact, `.index` even starts exceeding the buffer length. Anyway, a least one group of test failures for the changes in this PR are because, by switching to all-buffers, `matchAll` is now inheriting that same behvior.
uhop commented 7 months ago

I was thinking about a fix like that:

The only problem is to convert offsets from UTF-8 to UTF-16. Plus we need to convert buffers to strings. We can do matchAll() as a post-processing step. So the whole matchAll() can be a JS function. We will need a modified GetUtf16Length() function, but it looks like you already started on that.

Relevant Buffer APIs:

replaceAll() is a little more involved because of callbacks from C++. That one can be implemented like this:

The post-processing step should be in the C++ code before/if calling a JS callback. We can indicate it using a custom flag, e.g., using '\b' as an indicator. Checking this boolean flag can be added to the existing code, which does this conversion already (internally re2 works with UTF-8 buffers anyway).

This way the replaceAll() code would include:

const re = new RE2(this, this.flags + '\b');

It will hide the fact that the source is a buffer.

A possible twist: we can do the post-processing for matchAll() in C++ too if it is more efficient. I doubt it'll save much but the proof in the benchmarking. For that, we can use the same '\b' flag.

I hope it makes sense. The idea was to minimize C++ changes yet fix inefficiencies. If readers think that my rambles were incoherent do not hesitate to ask questions and poke holes in the ideas.