Closed gibson042 closed 7 months ago
Whitespace is escaped to leave room for /x mode regexps in the future.
So to make sure I understand the issue properly, this would be solved if done by code units, and not code points?
Yes, but I think there is a possibility that a Space_Separator
is added in the future that exists in the higher U+100000-10FFFF range. So we would be adding this same support in the future.
EncodeForRegExpEscape step 4.e (which would be reached if input c were a Space_Separator supplementary code point in [U+10000, U+10FFFF]) results in a return value like
\u{…}
. The interpretation of such pattern text is dependent upon regular expression flags—specifically, it is interpreted as a |RegExpUnicodeEscapeSequence| that will match a code point with the contained hexadecimal value in the presence of a "u" or "v" flag, but otherwise is interpreted as either a syntax error or (only in a host supporting Annex B and only when the hexadecimal representation of the code point consists only of decimal digits) as a quantified |ExtendedAtom| "u" with the specified decimal count of repetitions (e.g.,/^\u{10000}$/.test("u".repeat(10000))
is true).Rather than returning results subject to conditional interpretation, EncodeForRegExpEscape should return a
\u…\u…
surrogate pair |RegExpUnicodeEscapeSequence| for such inputs (which work in both Unicode and non-Unicode regular expressions, e.g./^\uD834\uDF06$/u.test("𝌆")
and/^\uD834\uDF06$/v.test("𝌆")
and/^\uD834\uDF06$/.test("𝌆")
are all true).Or alternatively (and preferably IMO), EncodeForRegExpEscape should not escape all white space. I'm not certain why it does so right now, but looking back I suspect it is due to a misinterpretation of #30 (which requests escaping of control characters, and even more specifically line terminators—and even that isn't necessary).