Open Dunqing opened 3 months ago
We need to spend some time to finish printing strings and number
let a = '\udf06' is broken because it's a a surrogate number https://github.com/evanw/esbuild/blob/ac54f06dbfb146aee3c3c7bea174c7cd248d3194/internal/js_printer/js_printer.go#L197
I fully ported printUnquotedUTF16
before, but it didn't solve the problem
The oxc parser actually parses "\udf06"
the same as "\\udf06"
, so there's no way to distinguish these two strings when printing.
I ran into this issue because I'm working on a JS engine called Nova which uses oxc as its parser, and we can't use the test262 tests dealing with lone surrogates because of this.
I finally understood where the problem is after looking at other implementations written in Rust: https://github.com/boa-dev/boa/blob/532c00862c72d1e7a284a37ba5bb31ec20d2d8a9/core/parser/src/lexer/string.rs#L147
Our implementation was copied from jsparagus, where they didn't implement this part yet
While parsing '\\udf06'
, it pushes \ u d f 0 6 and the string becomes a valid utf8
what we need is https://github.com/boa-dev/boa/blob/532c00862c72d1e7a284a37ba5bb31ec20d2d8a9/core/parser/src/lexer/string.rs#L56
impl UTF16CodeUnitsBuffer for Vec<u16> {
fn push_code_point(&mut self, mut code_point: u32) {
if let Ok(cp) = code_point.try_into() {
self.push(cp);
return;
}
code_point -= 0x10000;
let cu1 = (code_point / 1024 + 0xD800)
.try_into()
.expect("decoded an u32 into two u16.");
let cu2 = (code_point % 1024 + 0xDC00)
.try_into()
.expect("decoded an u32 into two u16.");
self.push(cu1);
self.push(cu2);
}
}
I'm super confused right now, how should we store these two utf16 strings "\udf06"
and "\\udf06"
in StringLiteral::value
as Vec<u8>
? Should we change StringLiteral::value
to Vec<u16>
instead?
@andreubotella Can you help me out a little bit here? As I don't fully grasp utf8, utf16 and surrogates.
Similarly to how not all &[u8]
are valid UTF-8 strings, not all &[u16]
are valid UTF-16 strings. However, this is the way UTF-16 is defined today. In the early days of Unicode, only 2^16 characters were ever planned, and languages like JS essentially used &[u16]
as a canonical representation of Unicode, where each item was supposed to be a character. Later, Unicode grew, and UTF-16 was changed so it could represent higher characters, but that meant that some &[u16]
sequences were no longer valid (that is, lone surrogates). And it was too late for languages like JS to forbid them.
What this means is that some JS strings are invalid UTF-16, and for those strings there isn't a valid UTF-8 representation.
In Nova we're planning to deal with this by using the wtf8 crate, which gives you a superset of UTF-8 that can also represent lone surrogates. But this crate doesn't give you all of the features String
and str
does.
Another possibility would be to keep a valid UTF-8 representation of the string, where you would encode lone surrogates as the Unicode replacement character (\uFFFD
, �), which is what String::from_utf16_lossy
does. And you would additionally have an optional Vec<u16>
representation, only for invalid UTF-16 strings. If the Vec<u16>
representation is Some
, you would know the Vec<u8>
representation is lossy.
Just to check: This problem only applies to StringLiteral
s, right? Identifiers can also contain \u
escapes. Can they be invalid too? Making StringLiteral::value
a &[u16]
(or something) I think we could live with, but it'd be a bit of a nightmare if we had to do the same for BindingIdentifier::name
etc, to accommodate this tiny edge case.
A quick research among native languages:
[u16]
"\\udf06"
and "\\\u0000udf06"
If we are implementing a runtime, I'd go as far as changing it to [u16]
in our AST.
But, given that we need to interact with the Rust world for easy string comparisons, I'm in favor of lossy string + Option<Vec<u16>>
.
Just to check: This problem only applies to
StringLiteral
s, right? Identifiers can also contain\u
escapes. Can they be invalid too? MakingStringLiteral::value
a&[u16]
(or something) I think we could live with, but it'd be a bit of a nightmare if we had to do the same forBindingIdentifier::name
etc, to accommodate this tiny edge case.
My understanding is that escapes in identifiers can only encode Unicode characters that would otherwise be valid in that position in the identifier. And I think lone surrogates are not valid ID_Start or ID_Continue. So that could still use a regular str
.
Yes, from Mathias: https://mathiasbynens.be/notes/javascript-identifiers-es6
Yes, from Mathias: https://mathiasbynens.be/notes/javascript-identifiers-es6
Oh thank god! This problem was already annoying, but if it affected identifiers too, it would have been really annoying.
I'm in favor of lossy string +
Option<Vec<u16>>
.
I don't have a set view on which approach is better, but if going for the extra field, could we make it Option<Box<&[u16]>>
so it's 8 bytes, rather than 16? (given that this field will be None
99.9% of the time)
Related: https://github.com/oxc-project/oxc/issues/3518#issuecomment-2146978752
https://oxc-project.github.io/oxc/playground/?code=3YCAAICQgICAgICAgIC2mcpqpcsDOJx5VHQkvaEBWf7tgIx%2FfaNQgA%3D%3D