Underspecified behavior of global unicode regular expressions with lastIndex inside a surrogate pair

gibson042 commented 3 years ago

Description: RegExpBuiltinExec invokes its matcher closure with the input string and the value derived from lastIndex, and Pattern semantics for a unicode regular expression interpret the string str as a sequence of code points Input and then in step e "Let listIndex be the index into Input of the character that was obtained from element index of str." But it is unclear what should happen when index identifies the trailing surrogate of a pair—should listIndex be the position of the character derived from the pair, or should it be the position after that?

Unsurprisingly, the results of this underspecification are implementation divergence.

eshost Output:

$ eshost -sx 'var r = /\udf06/ug; r.lastIndex=1; var result = r.exec("\ud834\udf06\udf06");
    print(result && result.index, r.lastIndex)'
#### ChakraCore, Hermes, JavaScriptCore, Moddable XS
1 2

#### engine262
1 3

#### GraalJS, SpiderMonkey, V8
2 3

$ eshost -sx 'var r = /./ug; r.lastIndex=1; var result = r.exec("\ud834\udf06\udf06");
    print(result && result.index, r.lastIndex)'
#### ChakraCore, Hermes, JavaScriptCore, Moddable XS
1 2

#### engine262
1 3

#### GraalJS, SpiderMonkey, V8
0 2

JSC, XS, et al. interpret the trailing surrogate as if it were unpaired, but that is not consistent with closure algorithm step c ("let Input be a List whose elements are the code units that are the elements of str"). Graal, SpiderMonkey, and V8 get that part right, but seem to vary based on the regular expression itself whether or not to check the start of the character split by lastIndex. And engine262 is just right out.

Closure algorithm step e should clearly be clarified, and I see two acceptable options:

Have listIndex correspond with the position of the character split by lastIndex, stepping backward by one code unit.
Have listIndex correspond with the position after that, stepping forward by one code unit.

Option 2 makes more sense to me, but this is such an edge case that I wouldn't argue strongly against option 1.

allenwb commented 3 years ago

step e "Let listIndex be the index into Input of the character that was obtained from element index of str."

I originally wrote that text. The intended meaning when Unicode is true was that a string index value corresponding to either surrogate of a valid surrogate pair is a reference to the corresponding code point. IOW, when index identifies the trailing surrogate of a pair, listIndex is the position of the character derived from the pair.

That was an intentional design discussion and generally how string indices that reference a valid trailing surrogate of a pair are handled throughout the RegExp spec. For example see step 12.a of RegExpBuiltinExec

gibson042 commented 3 years ago

RegExpBuiltinExec step 12.a unambiguously maps an index in a list of code points to the index of its first code unit in the string from which those code points were derived, so I don't see the connection. But regardless, the original author indicating that option 1 was intended is certainly strong support for it.

anba commented 3 years ago

This looks like a duplicate of #128.

gibson042 commented 3 years ago

I agree, and find this consequence of option 1 mentioned in https://github.com/tc39/ecma262/issues/128#issuecomment-393748953 to be particularly interesting:

It would be helpful to add a note that a regexp can produce a match at an index prior to lastIndex. I would think many developers have traditionally assumed that a match could only be found at, or after, lastIndex.

Code that works with a caller-provided RegExp object will suddenly break if they are given certain unicode regexps.

tc39 / ecma262

Underspecified behavior of global unicode regular expressions with lastIndex inside a surrogate pair #2511