Closed m0s closed 2 years ago
It looks like RE2 matches literally, it does not extend a range for surrogates. In order to fix it, we need to understand what RegExp is doing and do the same likely in the translation function.
Thoughts on what to do? Thoughts on how to test RegExp and determine its behavior?
Looking at MDN, I can see that there are three cases dealing with hexadecimal codes:
Characters | Meaning |
---|---|
\xhh |
Matches the character with the code hh (two hexadecimal digits). |
\uhhhh |
Matches a UTF-16 code-unit with the value hhhh (four hexadecimal digits). |
\u{hhhh} or \u{hhhhh} |
(Only when the u flag is set.) Matches the character with the Unicode value U+hhhh or U+hhhhh (hexadecimal digits). |
Playing with RegExp I've noticed that it looks like it operates with hexadecimal digits in UTF-16 notation and doesn't deal with code points. To wit:
/\ud835/.test('𝓱𝓮𝓵𝓵𝓸') // true
/\ud835\udcf1/.test('𝓱𝓮𝓵𝓵𝓸') // true
re2
the C++ library operates in UTF-8 and cannot deal with UTF-16 artifacts.
Having said that I want to point out that RE2 always works in u
mode: RegExp Unicode, which is documented in the readme.
So, technically, we should do it like that:
/\ud835/u.test('𝓱𝓮𝓵𝓵𝓸') // false
// your examples:
/[\u00a1-\uffff]/u.test('𝓱𝓮𝓵𝓵𝓸') // false
/[\u00a1-\u{fffff}]/u.test('𝓱𝓮𝓵𝓵𝓸') // true
The same with RE2:
RE2(/[\u00a1-\uffff]/u).test('𝓱𝓮𝓵𝓵𝓸') // false
RE2(/[\u00a1-\u{fffff}]/u).test('𝓱𝓮𝓵𝓵𝓸') // true
We may do a better job with translating from non-Unicode patterns to Unicode ones. Or warn people about that (we already have a mechanism for that). I am completely open to suggestions. Thoughts?
PS: One thing I would want to avoid is to break existing code for our users. So any suggestion should take that into account.
Thanks for detailed response and references. Looking back, as API user my trouble was with ignorance with regards to unicode mode... while I saw the warning I wasn't able to connect it with my issue. Users who don't need to match surrogate characters probably don't need to worry about being in unicode mode. One suggestion is perhaps to mention this use case in the readme section about being in u
mode. Maybe a warning can be generated during translation if code point in surrogate range is given?
I'll update the readme. I inspected what warnings we have and it looks like they cover this case. In any case, I'll try to think about some way to help users more in this regard.
Done in f267942eb265730c9125f1b55080beb8ead503d1
...and released as 1.17.2.
With RegExp
[\u00a1-\uffff]
matches𝓱𝓮𝓵𝓵𝓸
, however with RE2 this pattern internally gets translated to[\x{00a1}-\x{ffff}]
which RE2 won't match the same string. It seems RegExp is happy to match the components𝓱
-> D835 + DCF1, while RE2 wants to match the original𝓱
->1D4F1
so[\x{00a1}-\x{fffff}]
(with 5 fs) matches it fine. Is this a defect or expected behavior? Also wondering if[\x{00a1}-\x{fffff}]
is the correct workaround?