Open gibson042 opened 3 years ago
step e "Let listIndex be the index into Input of the character that was obtained from element index of str."
I originally wrote that text. The intended meaning when Unicode is true was that a string index value corresponding to either surrogate of a valid surrogate pair is a reference to the corresponding code point. IOW, when index identifies the trailing surrogate of a pair, listIndex is the position of the character derived from the pair.
That was an intentional design discussion and generally how string indices that reference a valid trailing surrogate of a pair are handled throughout the RegExp spec. For example see step 12.a of RegExpBuiltinExec
RegExpBuiltinExec step 12.a unambiguously maps an index in a list of code points to the index of its first code unit in the string from which those code points were derived, so I don't see the connection. But regardless, the original author indicating that option 1 was intended is certainly strong support for it.
This looks like a duplicate of #128.
I agree, and find this consequence of option 1 mentioned in https://github.com/tc39/ecma262/issues/128#issuecomment-393748953 to be particularly interesting:
It would be helpful to add a note that a regexp can produce a match at an index prior to lastIndex. I would think many developers have traditionally assumed that a match could only be found at, or after, lastIndex.
Code that works with a caller-provided RegExp object will suddenly break if they are given certain unicode regexps.
Description: RegExpBuiltinExec invokes its matcher closure with the input string and the value derived from
lastIndex
, and Pattern semantics for a unicode regular expression interpret the string str as a sequence of code points Input and then in step e "Let listIndex be the index into Input of the character that was obtained from element index of str." But it is unclear what should happen when index identifies the trailing surrogate of a pair—should listIndex be the position of the character derived from the pair, or should it be the position after that?Unsurprisingly, the results of this underspecification are implementation divergence.
eshost Output:
JSC, XS, et al. interpret the trailing surrogate as if it were unpaired, but that is not consistent with closure algorithm step c ("let Input be a List whose elements are the code units that are the elements of str"). Graal, SpiderMonkey, and V8 get that part right, but seem to vary based on the regular expression itself whether or not to check the start of the character split by
lastIndex
. And engine262 is just right out.Closure algorithm step e should clearly be clarified, and I see two acceptable options:
lastIndex
, stepping backward by one code unit.Option 2 makes more sense to me, but this is such an edge case that I wouldn't argue strongly against option 1.