Closed camertron closed 1 week ago
I suppose we could use a combination of both strategies. We can't use ASCII for any/all non-UTF-8 encodings because Ruby string literals can include both valid UTF-8 sequences and arbitrary bytes, eg. "\xFF鹿\xFF"
. We will still need the approach in this PR to handle decoding those kinds of strings embedded in UTF-8-encoded files.
The idea behind this PR was to essentially encode a series of bytes in UTF-16, but I think we're in a bit of a tough spot here. There's currently no way for the consumer of the parse tree to know when they're receiving a "normal" string containing a well-formed sequence of UTF-16 characters versus a string that only looks like UTF-16 but actually consists of a sequence of packed bytes. It's even more confusing if we fall back to packing when encountering an invalid byte sequence in a UTF-8 string (i.e. the proposal in this PR).
In light of that, I think we probably want the following:
\xFF
.Node
indicating the type of decoding that occurred.So these flags should already be there, depending on the combination of the byte sequences and the encoding of the file. For example:
# encoding: ascii
"\xFF"
is going to give you:
@ ProgramNode (location: (2,0)-(2,6))
├── locals: []
└── statements:
@ StatementsNode (location: (2,0)-(2,6))
└── body: (length: 1)
└── @ StringNode (location: (2,0)-(2,6))
├── flags: forced_binary_encoding
├── opening_loc: (2,0)-(2,1) = "\""
├── content_loc: (2,1)-(2,5) = "\\xFF"
├── closing_loc: (2,5)-(2,6) = "\""
└── unescaped: "\xFF"
and:
# encoding: ascii
"\u{80}"
will give you:
@ ProgramNode (location: (2,0)-(2,8))
├── locals: []
└── statements:
@ StatementsNode (location: (2,0)-(2,8))
└── body: (length: 1)
└── @ StringNode (location: (2,0)-(2,8))
├── flags: forced_utf8_encoding
├── opening_loc: (2,0)-(2,1) = "\""
├── content_loc: (2,1)-(2,7) = "\\u{80}"
├── closing_loc: (2,7)-(2,8) = "\""
└── unescaped: "\xC2\x80"
so in theory you could look at the flags on the string to know how to decode.
Ah yeah, I saw those flags while looking into this, but they didn't immediately make sense to me. I just wrote a little script that tries various combinations of strings and file encodings to help me visualize what they mean. Maybe the table will be helpful to others who run across this PR:
Encoding | String | Forced UTF-8 | Forced binary |
---|---|---|---|
UTF-8 | \xFF\xFF\xFF | true | false |
UTF-8 | \xFF鹿\xFF | true | false |
UTF-8 | 鹿 | false | false |
UTF-8 | \u{9e7f} | true | false |
ascii | \xFF\xFF\xFF | false | true |
ascii | \xFF鹿\xFF | false | true |
ascii | 鹿 | false | false |
ascii | \u{9e7f} | true | false |
Things to note:
\x
syntax or \u
syntax.\x
syntax.\u
syntax is used.I'm not sure there's enough information to correctly choose the right TextDecoder
. We can use the ASCII decoder if either forced flag is set. Unfortunately "\u{9e7f}"
is forced UTF-8 in both UTF-8- and ASCII-encoded files despite containing a valid UTF-8 byte sequence - we should use the UTF-8 decoder in that case.
Does that make sense or am I totally missing something?
That's almost correct.
Strings in UTF-8-encoded files are never forced binary.
This is correct. Regardless of bytes, UTF-8-encoded files always contain UTF-8-encoded strings. Files can contain syntax errors if they have invalid bytes in UTF-8 however, as in:
irb(main):001> eval(%Q{"\xFF"})
(irb):1:in `eval': (eval at (irb):1):1: invalid multibyte char (UTF-8) (SyntaxError)
Strings in UTF-8-encoded files are forced UTF-8 if they contain bytes via eg. \x syntax or \u syntax.
This is true. Strings in Ruby can be "broken" and contain invalid bytes. That means they'll respond false
to String#valid_encoding?
. In the context of prism, if a string is forced_utf8_encoding
then it means there was an escape sequence that locked in the encoding. This can happen in other files as well, for example if you have:
# encoding: Shift_JIS
"\u{80}"
then the string will be UTF-8 encoded, not Shift_JIS encoded.
Strings in ASCII-encoded files are only forced binary if the string contains bytes via eg. \x syntax.
This is true if you're talking about valid files only. Invalid files can have an error attached but have strings with forced_binary_encoding
if they have anything >= 0x80 in them.
Strings in ASCII-encoded files are only forced UTF-8 if the \u syntax is used.
This is true.
Given these constraints, maybe it's best to do something like you said, as in:
if flags & forced_binary_encoding
just use raw bytes
else
encoding = flags & forced_utf8_encoding ? utf-8 : file encoding
try { decode with encoding } catch { just use raw bytes, capture what the encoding should be, set flag saying valid_encoding? is false }
end
also it appears that more encodings than I thought are supported by browsers according to https://encoding.spec.whatwg.org/#names-and-labels. You might be able to get support for most of the encodings that prism/Ruby supports through it.
We might even be able to support other encodings with x-user-defined
if we're clever.
Ok awesome, that makes sense to me 👍 I'll try to update this PR with that algorithm.
capture what the encoding should be, set flag saying valid_encoding? is false
Hmm, you mean set a flag on the StringNode
? Hopefully I can figure out how to do that 😅
also it appears that more encodings than I thought are supported by browsers
Oh yeah, I noticed that too. I think all these strings can be passed to TextDecoder
's constructor.
Alright, I'm going to close this PR and open a new one with an updated description. I'd rather do that than edit this one, so our conversation can be preserved.
The Problem
In Ruby, it is possible to create a literal string containing an invalid byte sequence:
This happens on my machine because the default encoding is set to UTF-8:
When Prism parses this code, it uses a
TextDecoder
to convert the byte sequence to a JavaScript string. For the example above, the decoder produces the following result:65533 is the Unicode replacement character U+FFFD (instances of
TextDecoder
assume UTF-8 by default). Since the byte sequence above isn't valid UTF-8, the decoder replaces each of the bytes with the Unicode replacement character and returns as if nothing went wrong.Potential Solution
Rather than produce a string containing Unicode replacement characters, I propose Prism fall back to ASCII decoding. This way, the string produced by Prism will be the same as the string produced by Ruby - a sequence of bytes represented in ASCII encoding.
Prism can fall back only in the case that UTF-8 decoding fails, eg. by catching the error thrown by
TextDecoder.new("utf-8", {fatal: true})
.