tc39 / ecma262

Status, process, and documents for ECMA-262
https://tc39.es/ecma262/
Other
15.06k stars 1.28k forks source link

RegExp: A note on matching lone surrogates in unicode mode #2943

Open pthier opened 2 years ago

pthier commented 2 years ago

I recently discovered a bug in V8 where lone surrogates separated by a backreference (capturing undefined, because it hasn't captured anything yet) incorrectly match a combined surrogate pair (new RegExp('(\ud801\\1\udc0f)','u')).exec('\ud801\udc0f')). Investigating a little more showed that other engines also have bugs w.r.t lone surrogate pairs, which makes me think that an added Note to the spec might be appropriate.

Some interesting tests:

print(/[\ud800-\udfff]+/u.exec('\ud801\udc0f'));                           // expected: null
print((new RegExp('(\ud801\\1\udc0f)','u')).exec('\ud801\udc0f'));         // expected: null
print((new RegExp('(\ud801\\1?\udc0f)','u')).exec('\ud801\udc0f'));        // expected: null
print((new RegExp('(\ud801\\1{0}\udc0f)','u')).exec('\ud801\udc0f'));      // expected: null

And the output of Chakrja, JSC and V8:

#### chakra
𐐏
𐐏,𐐏
𐐏,𐐏
𐐏,𐐏

#### jsc
�       // \udc0f
null
null
𐐏,𐐏

#### v8
null
𐐏,𐐏
null
null

V8 Bug: https://crbug.com/v8/13410

mathiasbynens commented 2 years ago

cc @michaelficarra @bakkot @gibson042 @markusicu