JSON.dump: avoid redundant UTF-8 validation

casperisfine commented 1 month ago

While profiling JSON.dump I noticed a large amount of time is spent validating UTF-8:

Given that we called rb_enc_str_asciionly_p, if the string encoding isn't valid UTF-8, we can't know it very cheaply by checking the encoding and coderange that was just computed by Ruby, rather than to do it ourselves.

Also Ruby might have already computed that earlier.

cc @hsbt

Earlopain commented 8 hours ago

:wave: Just wanted to let you know that I encountered some breakage from this PR:

Reduced (actual code involves networking somewhere):

require "stringio"
require "json"

foo = StringIO.new("".b)
foo << '{"foo":"♥"}'
str = foo.string

pp str.encoding # => #<Encoding:BINARY (ASCII-8BIT)>
pp str.valid_encoding? # => true
pp str.bytes # => [123, 34, 102, 111, 111, 34, 58, 34, 226, 153, 165, 34, 125]
JSON.generate(str)
# lib/json/common.rb:306:in 'JSON::Ext::Generator::State#generate': source sequence is illegal/malformed utf-8 (JSON::GeneratorError)

This already doesn't occur anymore since https://github.com/ruby/json/commit/c96351f87403809b9022f30750144f48efc8c697 (or maybe https://github.com/ruby/json/commit/0819553144aefa422c6bc9d88a596886a7ed8e6b but that doesn't build for me)

I'm very unfamiliar with this but here is what I found from this PR: Ref: https://github.com/ruby/json/pull/595/files#diff-2bb51be932dec14923f6eb515f24b1b593737f0d3f8e76eeecf58cff3052819fR206-R213

String is (obviously) not ascii only
RB_ENC_CODERANGE(string) == RUBY_ENC_CODERANGE_VALID
RB_ENCODING_GET_INLINED(string) == 0, however rb_utf8_encindex() == 1

This all makes sense, I think. If I understand this, the string would actually need to declare itself as utf8, even if the bytes already happen to be valid utf8.

Is this supposed to work (should a test be added?) or am I relying on unspecified behavior?

casperisfine commented 7 hours ago

or am I relying on unspecified behavior?

Well, you can always debate this, but you are passing a BINARY string to JSON.

previously it would inspect the string to figure out if that binary was valid UTF-8 by chance, but no longer does. This made sense back in Ruby 1.8, but now that Ruby strings have an associated encoding, it no longer does.

Given the cost of checking for that, I think it's an acceptable regression.

When getting data from the network like you suggest you do, you should ensure strings have the proper encoding at the boundaries.

Earlopain commented 7 hours ago

I'm good with this changing, the fix for me would be simple. It just took some effort to find out where the string was coming from.

Anyways, I wonder if your optimization is still in place. With https://github.com/ruby/json/commit/c96351f87403809b9022f30750144f48efc8c697 I no longer encounter the exception, even without changing my code. Here it seems to check every byte again? https://github.com/ruby/json/blob/e2dd8341ae35ede9fb0689a9a1e7eb66ef942de6/ext/json/ext/generator/generator.c#L48

Edit:

I also found https://github.com/ruby/json/blob/e2dd8341ae35ede9fb0689a9a1e7eb66ef942de6/ext/json/ext/generator/generator.c#L760-L765

which includes binary. I suspect binary is the only encoding that behaves this way, the other seem to just be reencoded (?)

casperisfine commented 7 hours ago

I wonder if your optimization is still in place.

Yes, I decided to merge the big PR that rewrite a lot to restart from a clean base. I may have to re-do this PR.

eregon commented 5 hours ago

FWIW this is what the pure-Ruby backend does: https://github.com/ruby/json/blob/e2dd8341ae35ede9fb0689a9a1e7eb66ef942de6/lib/json/pure/generator.rb#L458 That will work on BINARY strings only if they are ascii_only?, otherwise it will raise.

@Earlopain rb_usascii_encoding() is not the BINARY encoding, it's the US-ASCII encoding.

casperisfine commented 5 hours ago

That will work on BINARY strings only if they are ascii_only?, otherwise it will raise.

That sounds sensible. I think it's what this PR was doing? Definitely worth adding a test for that.

Earlopain commented 5 hours ago

@eregon yes, my bad. I was confusing it with ASCII_8BIT. This makes more sense that way.

ruby / json

JSON.dump: avoid redundant UTF-8 validation #595