codegen: string surrogates printed incorrectly

Dunqing commented 3 months ago

https://oxc-project.github.io/oxc/playground/?code=3YCAAICQgICAgICAgIC2mcpqpcsDOJx5VHQkvaEBWf7tgIx%2FfaNQgA%3D%3D

Boshen commented 2 months ago

We need to spend some time to finish printing strings and number

https://github.com/evanw/esbuild/blob/ac54f06dbfb146aee3c3c7bea174c7cd248d3194/internal/js_printer/js_printer.go#L64

let a = '\udf06' is broken because it's a a surrogate number https://github.com/evanw/esbuild/blob/ac54f06dbfb146aee3c3c7bea174c7cd248d3194/internal/js_printer/js_printer.go#L197

Dunqing commented 2 months ago

I fully ported printUnquotedUTF16 before, but it didn't solve the problem

andreubotella commented 2 months ago

The oxc parser actually parses "\udf06" the same as "\\udf06", so there's no way to distinguish these two strings when printing.

I ran into this issue because I'm working on a JS engine called Nova which uses oxc as its parser, and we can't use the test262 tests dealing with lone surrogates because of this.

Boshen commented 1 month ago

I finally understood where the problem is after looking at other implementations written in Rust: https://github.com/boa-dev/boa/blob/532c00862c72d1e7a284a37ba5bb31ec20d2d8a9/core/parser/src/lexer/string.rs#L147

Our implementation was copied from jsparagus, where they didn't implement this part yet

https://github.com/oxc-project/oxc/blob/92ee77487f0fb929f7f3d8098906c519330198a7/crates/oxc_parser/src/lexer/unicode.rs#L239

While parsing '\\udf06', it pushes \ u d f 0 6 and the string becomes a valid utf8

what we need is https://github.com/boa-dev/boa/blob/532c00862c72d1e7a284a37ba5bb31ec20d2d8a9/core/parser/src/lexer/string.rs#L56

impl UTF16CodeUnitsBuffer for Vec<u16> {
    fn push_code_point(&mut self, mut code_point: u32) {
        if let Ok(cp) = code_point.try_into() {
            self.push(cp);
            return;
        }
        code_point -= 0x10000;

        let cu1 = (code_point / 1024 + 0xD800)
            .try_into()
            .expect("decoded an u32 into two u16.");
        let cu2 = (code_point % 1024 + 0xDC00)
            .try_into()
            .expect("decoded an u32 into two u16.");
        self.push(cu1);
        self.push(cu2);
    }
}

Boshen commented 1 month ago

I'm super confused right now, how should we store these two utf16 strings "\udf06" and "\\udf06" in StringLiteral::value as Vec<u8>? Should we change StringLiteral::value to Vec<u16> instead?

@andreubotella Can you help me out a little bit here? As I don't fully grasp utf8, utf16 and surrogates.

andreubotella commented 1 month ago

Similarly to how not all &[u8] are valid UTF-8 strings, not all &[u16] are valid UTF-16 strings. However, this is the way UTF-16 is defined today. In the early days of Unicode, only 2^16 characters were ever planned, and languages like JS essentially used &[u16] as a canonical representation of Unicode, where each item was supposed to be a character. Later, Unicode grew, and UTF-16 was changed so it could represent higher characters, but that meant that some &[u16] sequences were no longer valid (that is, lone surrogates). And it was too late for languages like JS to forbid them.

What this means is that some JS strings are invalid UTF-16, and for those strings there isn't a valid UTF-8 representation.

In Nova we're planning to deal with this by using the wtf8 crate, which gives you a superset of UTF-8 that can also represent lone surrogates. But this crate doesn't give you all of the features String and str does.

Another possibility would be to keep a valid UTF-8 representation of the string, where you would encode lone surrogates as the Unicode replacement character (\uFFFD, �), which is what String::from_utf16_lossy does. And you would additionally have an optional Vec<u16> representation, only for invalid UTF-16 strings. If the Vec<u16> representation is Some, you would know the Vec<u8> representation is lossy.

overlookmotel commented 1 month ago

Just to check: This problem only applies to StringLiterals, right? Identifiers can also contain \u escapes. Can they be invalid too? Making StringLiteral::value a &[u16] (or something) I think we could live with, but it'd be a bit of a nightmare if we had to do the same for BindingIdentifier::name etc, to accommodate this tiny edge case.

Boshen commented 1 month ago

A quick research among native languages:

esbuild and boa uses [u16]
swc doesn't handle surrogates, ast explorer shows "\\udf06" and "\\\u0000udf06"

If we are implementing a runtime, I'd go as far as changing it to [u16] in our AST.

But, given that we need to interact with the Rust world for easy string comparisons, I'm in favor of lossy string + Option<Vec<u16>>.

andreubotella commented 1 month ago

Just to check: This problem only applies to StringLiterals, right? Identifiers can also contain \u escapes. Can they be invalid too? Making StringLiteral::value a &[u16] (or something) I think we could live with, but it'd be a bit of a nightmare if we had to do the same for BindingIdentifier::name etc, to accommodate this tiny edge case.

My understanding is that escapes in identifiers can only encode Unicode characters that would otherwise be valid in that position in the identifier. And I think lone surrogates are not valid ID_Start or ID_Continue. So that could still use a regular str.

Boshen commented 1 month ago

Yes, from Mathias: https://mathiasbynens.be/notes/javascript-identifiers-es6

overlookmotel commented 1 month ago

Yes, from Mathias: https://mathiasbynens.be/notes/javascript-identifiers-es6

Oh thank god! This problem was already annoying, but if it affected identifiers too, it would have been really annoying.

I'm in favor of lossy string + Option<Vec<u16>>.

I don't have a set view on which approach is better, but if going for the extra field, could we make it Option<Box<&[u16]>> so it's 8 bytes, rather than 16? (given that this field will be None 99.9% of the time)

oxc-project / oxc

codegen: string surrogates printed incorrectly #3526