mathiasbynens / regexpu-core

regexpu’s core functionality, i.e. `rewritePattern(pattern, flag, options)`, which enables rewriting regular expressions that make use of the ES6 `u` flag into equivalent ES5-compatible regular expression patterns.
https://mths.be/regexpu
MIT License
71 stars 12 forks source link

Invalid transpilation for `.` with `su` flags #23

Open RReverser opened 6 years ago

RReverser commented 6 years ago

The example in README with su flag works for a single character, but output seemed odd, so I tried to fuzz it for various strings and looks like output is incorrect / incompatible with native implementation (unless it's a bug in V8 instead).

Let's try an example with two any characters and an ill-formed string:

> s = '\0\uDC00'
'\u0000�'
> r1 = /^.{2}$/su
/.{2}/su
> r1.test(s)
true

As you can see, native regexp implementation (Node.js 11.0.0) matches it as expected.

Let's try with regexpu + useUnicodeFlag:

> r2 = new RegExp(rewritePattern('^.{2}$', 'su', { dotAllFlag: true, useUnicodeFlag: true }), 'u')
/^[\0-\u{10FFFF}]{2}$/u
> r2.test(s)
true

So far, so good.

Now what if we ask it to expand Unicode ranges?

> r3 = new RegExp(rewritePattern('^.{2}$', 'su', { dotAllFlag: true }))
/^(?:[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]){2}$/
> r3.test(s)
false

Oops, looks like a bug.

I didn't dig into reasons yet, but I suspect it has something to do with [^\uD800-\uDBFF]|^ part.

mathiasbynens commented 6 years ago

This is a known limitation of regenerate: https://github.com/mathiasbynens/regenerate#regenerateprototypetostringoptions

Note that lone low surrogates cannot be matched accurately using regular expressions in JavaScript. Regenerate’s output makes a best-effort approach but there can be false negatives in this regard.

Nowadays we have lookbehind support in JS, but a transpiler such as regexpu can’t rely on that (at least not until lookbehinds are widely supported).

Here’s the old relevant comment: https://github.com/mathiasbynens/regenerate/issues/28#issuecomment-72224808 I’d love to hear your thoughts on this!

RReverser commented 6 years ago

Hmm, yeah, I guess it's not an easy thing to solve in a general case and needs more research.

However, for #24 instead of having post-rewrite optimisation layer, I was rather thinking of having specialisations for common cases so that they wouldn't have to be expanded in the first place.

And, specifically for ., which, arguably, is likely to be a common case, having a specialisation would both produce more optimal output and remove it from such false negatives (for that particular false negative) if we convert /./u to something like /(?:.|[\uD800-\uDBFF][\uDC00-\uDFFF])/.

Arguably this is just a partial solution, but then, current implementation already covers only part of the full range anyway.

What do you think?

RReverser commented 6 years ago

Actually, maybe we could do the same for any case where we know the codepoint range (that is, when original regex doesn't explicitly try to match lone surrogates either)...

mathiasbynens commented 6 years ago

I was rather thinking of having specialisations for common cases so that they wouldn't have to be expanded in the first place.

I’m open to this as well. 👍🏻

RReverser commented 6 years ago

Digging into this deeper, I'm starting to think the issue can be solved deeper in regenerate. In splitBMP it tries to extract lone surrogates even when a range could be used as a whole (like with dot above) and splitting could be avoided. Hmm...

RReverser commented 6 years ago

Ah I found original issue https://github.com/mathiasbynens/regexpu/issues/16 now, and indeed the "simpler" transpilation as shown above would break it again.