Allow processing of string literals with invalid encoding

gmcgibbon commented 4 years ago

Currently, ruby_parser cannot unescape string literals with invalid encoding (in my tests, strings with unicode and binary characters). When scanning a file like this:

class ValidatorTest < Minitest::Test
  def test_reject_malformed_string
    malformed_string = "\xBA\xBBƮ\xC1\xF7\xB1\xB8"
    refute_predicate Validator.new(malformed_string), :valid?
  end
end

The stack-trace ends at ruby_parser-3.14.1/lib/ruby_lexer.rex.rb:163:in `gsub': incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError). I've done two things to fix this:

Make the lexer test case file UTF-8. This is needed to add a weirdly encoded string test.
Change the rex template to parse literals with #process_string_literal. It essentially does the same thing with a fallback to encode the unquoted literal content as ASCII-8BIT.

zenspider commented 4 years ago

I couldn't sleep so I worked on this w/o remembering your PR. I came up with nearly the same solution with some minor exceptions. I use b instead of dup simply because it is shorter. One real difference is that in the assertions I force the encoding to the encoding of the expected... This lets me have tests with binary (or other) data and it needs to match. (I guess I need some tests to JUST check the encodings on output tho).

I even had to work out the US-ASCII declaration from the test file because I just couldn't figure out how to write the damn test otherwise.

Thank you for your contribution. Sorry I overlooked it. I did give you credit in the commit.

gmcgibbon commented 4 years ago

Thanks @zenspider, happy to see this fixed! 😄

seattlerb / ruby_parser

Allow processing of string literals with invalid encoding #305