Add test for parsing broken strings and use String#encode instead of rb_str_conv_enc() in parser

eregon commented 3 weeks ago

From https://github.com/ruby/json/pull/643/files#r1822740241

eregon commented 3 weeks ago

This is only for the parsing part. I think there it's fair to treat BINARY as UTF-8 unconditionally, because in the end we're just parsing bytes. If the encoding of the source string is unknown (== BINARY) it seems rather fair to assume it's actually UTF-8 for JSON.

For the generating part I think we could do the same: str.encoding == BINARY ? str.dup.force_encoding(UTF-8) : str.encode(UTF-8). But it's less clear if that's good or bad, because we're potentially trying to dump something which cannot be dumped faithfully. And it seems to already be an error in all previous JSON versions (except JSON 1.7.7 in Ruby 2.0!) when the String is broken, for ruby -rjson -e 'p JSON.dump(["\x80"])'. Let's keep that error (ref: https://github.com/ruby/json/issues/634). When it's binary but not broken, ruby -rjson -e 'p JSON::VERSION; p JSON.dump(["é".b])', it seems only since 2.6.1, 2.5.1 and before gave Encoding::UndefinedConversionError. Only the C extension as this behavior, so I think it's good to warn there (ref: https://github.com/ruby/json/pull/609).

TLDR: I think this change is good for the parsing part, for the generating part it seems good as-is and to add the warning in https://github.com/ruby/json/issues/642.

eregon commented 3 weeks ago

It's kind of funky that JSON used rb_str_export_to_enc() in #634 and rb_str_conv_enc() and both of these have error-prone semantics in case the conversion fails. At least now it should be consistent and just use String#encode, like the pure-Ruby version.

ruby / json

Add test for parsing broken strings and use String#encode instead of rb_str_conv_enc() in parser #665