Open RReverser opened 6 years ago
This is a known limitation of regenerate: https://github.com/mathiasbynens/regenerate#regenerateprototypetostringoptions
Note that lone low surrogates cannot be matched accurately using regular expressions in JavaScript. Regenerate’s output makes a best-effort approach but there can be false negatives in this regard.
Nowadays we have lookbehind support in JS, but a transpiler such as regexpu can’t rely on that (at least not until lookbehinds are widely supported).
Here’s the old relevant comment: https://github.com/mathiasbynens/regenerate/issues/28#issuecomment-72224808 I’d love to hear your thoughts on this!
Hmm, yeah, I guess it's not an easy thing to solve in a general case and needs more research.
However, for #24 instead of having post-rewrite optimisation layer, I was rather thinking of having specialisations for common cases so that they wouldn't have to be expanded in the first place.
And, specifically for .
, which, arguably, is likely to be a common case, having a specialisation would both produce more optimal output and remove it from such false negatives (for that particular false negative) if we convert /./u
to something like /(?:.|[\uD800-\uDBFF][\uDC00-\uDFFF])/
.
Arguably this is just a partial solution, but then, current implementation already covers only part of the full range anyway.
What do you think?
Actually, maybe we could do the same for any case where we know the codepoint range (that is, when original regex doesn't explicitly try to match lone surrogates either)...
I was rather thinking of having specialisations for common cases so that they wouldn't have to be expanded in the first place.
I’m open to this as well. 👍🏻
Digging into this deeper, I'm starting to think the issue can be solved deeper in regenerate. In splitBMP
it tries to extract lone surrogates even when a range could be used as a whole (like with dot above) and splitting could be avoided. Hmm...
Ah I found original issue https://github.com/mathiasbynens/regexpu/issues/16 now, and indeed the "simpler" transpilation as shown above would break it again.
The example in README with
su
flag works for a single character, but output seemed odd, so I tried to fuzz it for various strings and looks like output is incorrect / incompatible with native implementation (unless it's a bug in V8 instead).Let's try an example with two any characters and an ill-formed string:
As you can see, native regexp implementation (Node.js 11.0.0) matches it as expected.
Let's try with regexpu +
useUnicodeFlag
:So far, so good.
Now what if we ask it to expand Unicode ranges?
Oops, looks like a bug.
I didn't dig into reasons yet, but I suspect it has something to do with
[^\uD800-\uDBFF]|^
part.