uhop / node-re2

node.js bindings for RE2: fast, safe alternative to backtracking regular expression engines.
Other
495 stars 53 forks source link

Matching unicode surrogates broken? #122

Closed m0s closed 2 years ago

m0s commented 2 years ago

With RegExp [\u00a1-\uffff] matches 𝓱𝓮𝓵𝓵𝓸, however with RE2 this pattern internally gets translated to [\x{00a1}-\x{ffff}] which RE2 won't match the same string. It seems RegExp is happy to match the components 𝓱 -> D835 + DCF1, while RE2 wants to match the original 𝓱 -> 1D4F1 so [\x{00a1}-\x{fffff}] (with 5 fs) matches it fine. Is this a defect or expected behavior? Also wondering if [\x{00a1}-\x{fffff}] is the correct workaround?

uhop commented 2 years ago

It looks like RE2 matches literally, it does not extend a range for surrogates. In order to fix it, we need to understand what RegExp is doing and do the same likely in the translation function.

Thoughts on what to do? Thoughts on how to test RegExp and determine its behavior?

uhop commented 2 years ago

Looking at MDN, I can see that there are three cases dealing with hexadecimal codes:

Characters Meaning
\xhh Matches the character with the code hh (two hexadecimal digits).
\uhhhh Matches a UTF-16 code-unit with the value hhhh (four hexadecimal digits).
\u{hhhh} or \u{hhhhh} (Only when the u flag is set.) Matches the character with the Unicode value U+hhhh or U+hhhhh (hexadecimal digits).

Playing with RegExp I've noticed that it looks like it operates with hexadecimal digits in UTF-16 notation and doesn't deal with code points. To wit:

 /\ud835/.test('𝓱𝓮𝓵𝓵𝓸')       // true
 /\ud835\udcf1/.test('𝓱𝓮𝓵𝓵𝓸') // true

re2 the C++ library operates in UTF-8 and cannot deal with UTF-16 artifacts.

Having said that I want to point out that RE2 always works in u mode: RegExp Unicode, which is documented in the readme.

So, technically, we should do it like that:

 /\ud835/u.test('𝓱𝓮𝓵𝓵𝓸')            // false

// your examples:
/[\u00a1-\uffff]/u.test('𝓱𝓮𝓵𝓵𝓸')    // false
/[\u00a1-\u{fffff}]/u.test('𝓱𝓮𝓵𝓵𝓸') // true

The same with RE2:

RE2(/[\u00a1-\uffff]/u).test('𝓱𝓮𝓵𝓵𝓸')    // false
RE2(/[\u00a1-\u{fffff}]/u).test('𝓱𝓮𝓵𝓵𝓸') // true

We may do a better job with translating from non-Unicode patterns to Unicode ones. Or warn people about that (we already have a mechanism for that). I am completely open to suggestions. Thoughts?

PS: One thing I would want to avoid is to break existing code for our users. So any suggestion should take that into account.

m0s commented 2 years ago

Thanks for detailed response and references. Looking back, as API user my trouble was with ignorance with regards to unicode mode... while I saw the warning I wasn't able to connect it with my issue. Users who don't need to match surrogate characters probably don't need to worry about being in unicode mode. One suggestion is perhaps to mention this use case in the readme section about being in u mode. Maybe a warning can be generated during translation if code point in surrogate range is given?

uhop commented 2 years ago

I'll update the readme. I inspected what warnings we have and it looks like they cover this case. In any case, I'll try to think about some way to help users more in this regard.

uhop commented 2 years ago

Done in f267942eb265730c9125f1b55080beb8ead503d1

uhop commented 2 years ago

...and released as 1.17.2.