mozilla-spidermonkey / jsparagus

Experimental JS parser-generator project.
Other
448 stars 20 forks source link

Allow UnicodeEscapeSequence for surrogate code unit #21

Open jorendorff opened 5 years ago

jorendorff commented 5 years ago

This sort of thing is valid and actually pretty common:

"(?:[\uD800-\uDBFF][\uDC00-\uDFFF]|[\0-\uFFFF])"

jorendorff commented 4 years ago

Affected jit-tests:

jorendorff commented 4 years ago

JS allows unpaired surrogates, and the test suite naturally loves to hit this corner case. We need the ability to pass non-UTF16 JS strings from Visage to C++.

codehag commented 4 years ago

I am paraphrasing what we wrote in the chat.

We currently have strings implemented via rust's &str. This represents strings as utf8. However, javascript strings are not utf8, they are instead a Vec, that behaves a lot like utf16 -> https://tc39.es/ecma262/#sec-ecmascript-language-types-string-type

Unfortunately we can't rely on str in this case, as we need to accept invalid utf16. So, we need to implement the JavaScript String type as specified.

Some pseudo code to get the idea across, it might look something like enum JsString<'a> { Borrowed(&'a str), Owned(String), Owned16(Vec<u16>) }, but not quite. This will take some work in lexer.rs.

I think I have a clear idea of this now and can get started. If something isn't quite right here please correct me.

codehag commented 4 years ago

Not actively working on this right now