This PR is a continuation of https://github.com/ruby/prism/pull/2893 that uses several flags from the parser to more correctly decode strings for use in JavaScript.
I thought the most straightforward way to get everything working was to introduce a new datatype called EncodedString that contains three pieces of information: 1) the string itself, 2) the encoding of the string, and 2) whether or not the string is valid in its original encoding. Unfortunately these new fields have necessitated a breaking change to Prism's API. Now, a string's value is accessible via .unescaped.value rather than simply .unescaped. If that's a problem, I can work up a solution that is purely additive.
Actual decoding is done following the algorithm @kddnewton devised in #2893:
if flags & forced_binary_encoding
just use raw bytes
else
encoding = flags & forced_utf8_encoding ? utf-8 : file encoding
try {
decode with encoding
} catch {
just use raw bytes, capture what the encoding should be, set flag saying valid_encoding? is false
}
end
The idea here is that, if a string with an invalid encoding is provided, Prism should treat it as binary and return a JavaScript string where the characters are the raw bytes of the original string, encoded in UTF-16. This is done to better preserve the contents of the string and provide enough metadata so that Ruby implementations can correctly operate on it.
This PR also now supports as many encodings as JavaScript's TextDecoder class. There should be an almost 1:1 mapping between Ruby/Prism encoding names and JavaScript encoding names, but I haven't actually tried to map them. If that's a problem, I'd be happy to work on such a mapping.
This PR is a continuation of https://github.com/ruby/prism/pull/2893 that uses several flags from the parser to more correctly decode strings for use in JavaScript.
I thought the most straightforward way to get everything working was to introduce a new datatype called
EncodedString
that contains three pieces of information: 1) the string itself, 2) the encoding of the string, and 2) whether or not the string is valid in its original encoding. Unfortunately these new fields have necessitated a breaking change to Prism's API. Now, a string's value is accessible via.unescaped.value
rather than simply.unescaped
. If that's a problem, I can work up a solution that is purely additive.Actual decoding is done following the algorithm @kddnewton devised in #2893:
The idea here is that, if a string with an invalid encoding is provided, Prism should treat it as binary and return a JavaScript string where the characters are the raw bytes of the original string, encoded in UTF-16. This is done to better preserve the contents of the string and provide enough metadata so that Ruby implementations can correctly operate on it.
This PR also now supports as many encodings as JavaScript's
TextDecoder
class. There should be an almost 1:1 mapping between Ruby/Prism encoding names and JavaScript encoding names, but I haven't actually tried to map them. If that's a problem, I'd be happy to work on such a mapping.