ruby / prism

Prism Ruby parser
https://ruby.github.io/prism/
MIT License
790 stars 134 forks source link

Correctly deserialize non-UTF-8 strings for JavaScript #2916

Open camertron opened 1 week ago

camertron commented 1 week ago

This PR is a continuation of https://github.com/ruby/prism/pull/2893 that uses several flags from the parser to more correctly decode strings for use in JavaScript.

I thought the most straightforward way to get everything working was to introduce a new datatype called EncodedString that contains three pieces of information: 1) the string itself, 2) the encoding of the string, and 2) whether or not the string is valid in its original encoding. Unfortunately these new fields have necessitated a breaking change to Prism's API. Now, a string's value is accessible via .unescaped.value rather than simply .unescaped. If that's a problem, I can work up a solution that is purely additive.

Actual decoding is done following the algorithm @kddnewton devised in #2893:

if flags & forced_binary_encoding
  just use raw bytes
else
  encoding = flags & forced_utf8_encoding ? utf-8 : file encoding
  try {
    decode with encoding
  } catch {
    just use raw bytes, capture what the encoding should be, set flag saying valid_encoding? is false
  }
end

The idea here is that, if a string with an invalid encoding is provided, Prism should treat it as binary and return a JavaScript string where the characters are the raw bytes of the original string, encoded in UTF-16. This is done to better preserve the contents of the string and provide enough metadata so that Ruby implementations can correctly operate on it.

This PR also now supports as many encodings as JavaScript's TextDecoder class. There should be an almost 1:1 mapping between Ruby/Prism encoding names and JavaScript encoding names, but I haven't actually tried to map them. If that's a problem, I'd be happy to work on such a mapping.