Correctly deserialize non-UTF-8 strings for JavaScript

camertron commented 3 weeks ago

The Problem

In Ruby, it is possible to create a literal string containing an invalid byte sequence:

str = "\xFF\xFF\xFF"
str.encoding  # => #<Encoding:UTF-8>
str.codepoints  # => invalid byte sequence in UTF-8 (ArgumentError)

This happens on my machine because the default encoding is set to UTF-8:

Encoding.default_external  # => #<Encoding:UTF-8>

When Prism parses this code, it uses a TextDecoder to convert the byte sequence to a JavaScript string. For the example above, the decoder produces the following result:

const buf = new Uint8Array([255, 255, 255]);
new TextDecoder().decode(buf).codePointAt(0);  // => 65533

65533 is the Unicode replacement character U+FFFD (instances of TextDecoder assume UTF-8 by default). Since the byte sequence above isn't valid UTF-8, the decoder replaces each of the bytes with the Unicode replacement character and returns as if nothing went wrong.

Potential Solution

Rather than produce a string containing Unicode replacement characters, I propose Prism fall back to ASCII decoding. This way, the string produced by Prism will be the same as the string produced by Ruby - a sequence of bytes represented in ASCII encoding.

const buf = new Uint8Array([255, 255, 255]);
new TextDecoder("ascii").decode(buf).codePointAt(0);  // => 255

Prism can fall back only in the case that UTF-8 decoding fails, eg. by catching the error thrown by TextDecoder.new("utf-8", {fatal: true}).

camertron commented 3 weeks ago

I suppose we could use a combination of both strategies. We can't use ASCII for any/all non-UTF-8 encodings because Ruby string literals can include both valid UTF-8 sequences and arbitrary bytes, eg. "\xFF鹿\xFF". We will still need the approach in this PR to handle decoding those kinds of strings embedded in UTF-8-encoded files.

The idea behind this PR was to essentially encode a series of bytes in UTF-16, but I think we're in a bit of a tough spot here. There's currently no way for the consumer of the parse tree to know when they're receiving a "normal" string containing a well-formed sequence of UTF-16 characters versus a string that only looks like UTF-16 but actually consists of a sequence of packed bytes. It's even more confusing if we fall back to packing when encountering an invalid byte sequence in a UTF-8 string (i.e. the proposal in this PR).

In light of that, I think we probably want the following:

Modify the parser to set a flag when encountering arbitrary byte sequences in string literals, eg. \xFF.
When deserializing, check the value of the flag. If the string contains arbitrary byte sequences, use the ASCII decoder. Otherwise use the UTF-8 decoder.
Set a flag on the AST Node indicating the type of decoding that occurred.

kddnewton commented 2 weeks ago

So these flags should already be there, depending on the combination of the byte sequences and the encoding of the file. For example:

# encoding: ascii
"\xFF"

is going to give you:

@ ProgramNode (location: (2,0)-(2,6))
├── locals: []
└── statements:
    @ StatementsNode (location: (2,0)-(2,6))
    └── body: (length: 1)
        └── @ StringNode (location: (2,0)-(2,6))
            ├── flags: forced_binary_encoding
            ├── opening_loc: (2,0)-(2,1) = "\""
            ├── content_loc: (2,1)-(2,5) = "\\xFF"
            ├── closing_loc: (2,5)-(2,6) = "\""
            └── unescaped: "\xFF"

and:

# encoding: ascii
"\u{80}"

will give you:

@ ProgramNode (location: (2,0)-(2,8))
├── locals: []
└── statements:
    @ StatementsNode (location: (2,0)-(2,8))
    └── body: (length: 1)
        └── @ StringNode (location: (2,0)-(2,8))
            ├── flags: forced_utf8_encoding
            ├── opening_loc: (2,0)-(2,1) = "\""
            ├── content_loc: (2,1)-(2,7) = "\\u{80}"
            ├── closing_loc: (2,7)-(2,8) = "\""
            └── unescaped: "\xC2\x80"

so in theory you could look at the flags on the string to know how to decode.

camertron commented 2 weeks ago

Ah yeah, I saw those flags while looking into this, but they didn't immediately make sense to me. I just wrote a little script that tries various combinations of strings and file encodings to help me visualize what they mean. Maybe the table will be helpful to others who run across this PR:

Encoding	String	Forced UTF-8	Forced binary
UTF-8	\xFF\xFF\xFF	true	false
UTF-8	\xFF鹿\xFF	true	false
UTF-8	鹿	false	false
UTF-8	\u{9e7f}	true	false
ascii	\xFF\xFF\xFF	false	true
ascii	\xFF鹿\xFF	false	true
ascii	鹿	false	false
ascii	\u{9e7f}	true	false

Things to note:

Strings in UTF-8-encoded files are never forced binary.
Strings in UTF-8-encoded files are forced UTF-8 if they contain bytes via eg. \x syntax or \u syntax.
Strings in ASCII-encoded files are only forced binary if the string contains bytes via eg. \x syntax.
Strings in ASCII-encoded files are only forced UTF-8 if the \u syntax is used.

I'm not sure there's enough information to correctly choose the right TextDecoder. We can use the ASCII decoder if either forced flag is set. Unfortunately "\u{9e7f}" is forced UTF-8 in both UTF-8- and ASCII-encoded files despite containing a valid UTF-8 byte sequence - we should use the UTF-8 decoder in that case.

Does that make sense or am I totally missing something?

kddnewton commented 2 weeks ago

That's almost correct.

Strings in UTF-8-encoded files are never forced binary.

This is correct. Regardless of bytes, UTF-8-encoded files always contain UTF-8-encoded strings. Files can contain syntax errors if they have invalid bytes in UTF-8 however, as in:

irb(main):001> eval(%Q{"\xFF"})
(irb):1:in `eval': (eval at (irb):1):1: invalid multibyte char (UTF-8) (SyntaxError)

Strings in UTF-8-encoded files are forced UTF-8 if they contain bytes via eg. \x syntax or \u syntax.

This is true. Strings in Ruby can be "broken" and contain invalid bytes. That means they'll respond false to String#valid_encoding?. In the context of prism, if a string is forced_utf8_encoding then it means there was an escape sequence that locked in the encoding. This can happen in other files as well, for example if you have:

# encoding: Shift_JIS
"\u{80}"

then the string will be UTF-8 encoded, not Shift_JIS encoded.

Strings in ASCII-encoded files are only forced binary if the string contains bytes via eg. \x syntax.

This is true if you're talking about valid files only. Invalid files can have an error attached but have strings with forced_binary_encoding if they have anything >= 0x80 in them.

Strings in ASCII-encoded files are only forced UTF-8 if the \u syntax is used.

This is true.

Given these constraints, maybe it's best to do something like you said, as in:

if flags & forced_binary_encoding
  just use raw bytes
else
  encoding = flags & forced_utf8_encoding ? utf-8 : file encoding
  try { decode with encoding } catch { just use raw bytes, capture what the encoding should be, set flag saying valid_encoding? is false }
end

also it appears that more encodings than I thought are supported by browsers according to https://encoding.spec.whatwg.org/#names-and-labels. You might be able to get support for most of the encodings that prism/Ruby supports through it.

kddnewton commented 2 weeks ago

We might even be able to support other encodings with x-user-defined if we're clever.

camertron commented 1 week ago

Ok awesome, that makes sense to me 👍 I'll try to update this PR with that algorithm.

capture what the encoding should be, set flag saying valid_encoding? is false

Hmm, you mean set a flag on the StringNode? Hopefully I can figure out how to do that 😅

also it appears that more encodings than I thought are supported by browsers

Oh yeah, I noticed that too. I think all these strings can be passed to TextDecoder's constructor.

camertron commented 1 week ago

Alright, I'm going to close this PR and open a new one with an updated description. I'd rather do that than edit this one, so our conversation can be preserved.

ruby / prism

Correctly deserialize non-UTF-8 strings for JavaScript #2893

The Problem

Potential Solution