Data read with Zlib::GzipReader (but not copy of same data) slow to parse with JSON.parse

oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.

https://www.graalvm.org/ruby/

Other

3.01k stars 184 forks source link

Data read with Zlib::GzipReader (but not copy of same data) slow to parse with JSON.parse #2193

Closed billdueber closed 2 months ago

billdueber commented 3 years ago

(first issue opened here; if I'm off the mark or in the wrong spot, my apologies and please let me know)

JSON-parsing data read from a gzipped file with Zlib::GzipReader is very slow. Parsing the same byte-for-byte data read from a non-gzipped file is not slow. Parsing a copy of the data read from a gzipped file is also not slow.

This gist gives a small program that creates a bunch of arrays-of-arrays and dumps json output into two files, one plain and one gzipped. It then reads those data back in via File.read or Zlib::GzipReader, additionally making a copy of the formerly-gzipped-data.

Note that I'm not parsing every time -- I'm just getting a copy of the data (as seen in this snippet from the gist):

plain_data = File.read(PLAIN)

previously_gzipped_data = Zlib::GzipReader.open(GZIP) do |f|
   f.read
end

forced_copy = previously_gzipped_data + " "

Oj doesn't show this same issue. I know benchmarking this stuff is treacherous, but the pattern seems pretty consistent.


truffleruby 20.3.0, like ruby 2.6.6, GraalVM CE JVM [x86_64-darwin]
Testing with an array of 1000 20-element arrays

BEGIN STDLIB JSON
                           user     system      total        real

Plain                      6.761722   0.218166   6.979888 (  1.597190)
Previously gzipped        15.290809   1.759818  17.050627 (  5.812477)
Gzipped/forced copy        0.112019   0.052004   0.164023 (  0.069019)

Plain                      0.060666   0.040497   0.101163 (  0.061761)
Previously gzipped         4.385712   1.001380   5.387092 (  4.677956)
Gzipped/forced copy        0.135513   0.002502   0.138015 (  0.062301)

Plain                      0.126698   0.004074   0.130772 (  0.067788)
Previously gzipped         4.375435   0.766456   5.141891 (  4.427382)
Gzipped/forced copy        0.055923   0.000771   0.056694 (  0.057266)

BEGIN Oj
                           user     system      total        real
Plain                      8.310956   0.346334   8.657290 (  1.903706)
Previously gzipped         0.849619   0.016991   0.866610 (  0.225177)
Gzipped/forced copy        0.753260   0.026104   0.779364 (  0.228311)
Plain                      0.540689   0.011607   0.552296 (  0.206072)
Previously gzipped         0.417603   0.008346   0.425949 (  0.195774)
Gzipped/forced copy        0.396726   0.013893   0.410619 (  0.206867)
Plain                      0.400436   0.010877   0.411313 (  0.218068)
Previously gzipped         0.421178   0.009404   0.430582 (  0.204174)
Gzipped/forced copy        0.34264

My original data was just a file on disk as gzipped with gzip, so I don't think this has anything to do with Zlib::GzipWriter -- that's just to make it a self-contained bug reproduction. I'm assuming that GzipReader is hanging on to the string in some way that makes things nasty?

-Bill-

chrisseaton commented 3 years ago

Can you try putting Truffle::Ropes.flatten_rope(data) on your data after you get it from gzip and before you use it?

billdueber commented 3 years ago

Adding previously_gzipped_data = Truffle::Ropes.flatten_rope(previously_gzipped_data) does the trick (although it doesn't help me for the moment since the calls are locked up in a gem). Hopefully it'll help point you in the right direction.

eregon commented 3 years ago

Thank you for the great report, it looks like pure-Ruby JSON parsing is slower with the String from Zlib::GzipReader.

Truffle::Ropes.flatten_rope() seems a good hint that this is an issues with Ropes. Of course, that's just helpful to debug, it should work fine without any explicit call (Truffle::Ropes.flatten_rope is a private debug-like API, just like everything else under Truffle::).

Maybe it has many ConcatRope and the reading accesses in JSON don't force to materialize the byte[]? Oj likely uses RSTRING_PTR() and that forces to flatten the Rope. @norswap Could you investigate?

@billdueber To make sure the pure-Ruby version of json, from stdlib, is used, could you show gem list | grep json on TruffleRuby? (it should be json (default: 2.3.0)) If the json gem is installed, then it will be used as a C extension and preferred to the pure-Ruby version, which is unfortunately rather confusing but that's the semantics of default gems.

eregon commented 3 years ago

It's simpler than that, actually:

p Truffle::CExt.native_string?(previously_gzipped_data) # => true

forced_copy = previously_gzipped_data + " "
p Truffle::CExt.native_string?(forced_copy) # => false

json = JSON.parse(previously_gzipped_data)
p Truffle::CExt.native_string?(previously_gzipped_data) # => true

So the issue when we use previously_gzipped_data, is that we use a NativeRope instead of a ManagedRope. And I would expect that String migrates from NativeRope to ManagedRope only once, but that doesn't seem the case since it's still native even after JSON.parse(previously_gzipped_data).

eregon commented 3 years ago

Most likely involved, this force_encoding in JSON: https://github.com/oracle/truffleruby/blob/e020c89a8368ef2e1b996904f7becf65b375a47e/lib/json/lib/json/pure/parser.rb#L140 previously_gzipped_data.encoding is UTF-8 (maybe a bit surprising? But MRI behaves the same).

And force_encoding actually returns a NativeRope when the source is a NativeRope. The .encode above the force_encoding also returns a different String, so the original String is never modified and stays native.

Then this String is passed to StringScanner, and StringScanner#scan is used which will run Regexps against it. Regexps use a RopeNodes.BytesNode, which for a NativeRope will allocate a byte[] and copy from native memory every time, which seems very likely to be the performance issue. Maybe BytesNode should take a String and not a Rope, so it could transition the String to managed if it's native only once, and store the ManagedRope back in the String. Also BytesNode seems a good indication of "read-heavy workload coming", so it seems a good place to do that.

billdueber commented 3 years ago

JSON gem (almost) just as you expected -- I apparently have 2.1.0

✦ ❯ gem list | grep json
json (default: 2.1.0)

eregon commented 3 years ago

Right, that's expected for TruffleRuby 20.3.0, thanks for confirming.

billdueber commented 1 year ago

Just pinging on this to see if it's on anyone's radar. I've updated the benchmarking code to explicitly assert that calling Truffle::Debug.flatten_string (nee Truffle::Ropes.flatten_rope) addresses the issue (it does, as expected).

Removed Oj because it make the "x slower" comparison so hard to read, but it's about 10x faster than JSON and doesn't suffer from gzip slowdown.

tl;dr is that on my hardware gzdata is about 30x slower on the smaller (array-of-100-arrays) json string, and about 400x slower on the larger json string.


truffleruby 22.3.0, like ruby 3.0.3, GraalVM CE Native [x86_64-darwin]
Base unit is an array of 20 integers

JSON-decode a string encoding an array of 100/1000 of those base units.

Warming up --------------------------------------
      JSON 100 plain    19.000  i/100ms
     JSON 100 gzdata     1.000  i/100ms
     JSON 1000 plain     2.000  i/100ms
    JSON 1000 gzdata     1.000  i/100ms
Calculating -------------------------------------
      JSON 100 plain    722.230  (±15.2%) i/s -      7.030k in  10.011312s
     JSON 100 gzdata     26.395  (±34.1%) i/s -    202.000  in  10.023596s
     JSON 1000 plain     73.287  (±13.6%) i/s -    716.000  in  10.003276s
    JSON 1000 gzdata      0.166  (± 0.0%) i/s -      2.000  in  12.601048s

Comparison:
      JSON 100 plain:      722.2 i/s
     JSON 1000 plain:       73.3 i/s - 9.85x  (± 0.00) slower
     JSON 100 gzdata:       26.4 i/s - 27.36x  (± 0.00) slower
    JSON 1000 gzdata:        0.2 i/s - 4355.93x  (± 0.00) slower

eregon commented 1 year ago

Thank you, we should indeed look at this again, especially since the migration to TruffleString which should help with this. Could you share your new benchmark file? In the gist it seems to be the older benchmark (e.g. using oj and not using benchmark/ips), but the newer benchmark seems much simpler and uses benchmark/ips so I'd gladly use that to investigate. The older benchmark file might still be needed to generate the files though.

billdueber commented 1 year ago

I threw up another gist with a less-organically-written benchmark script that strictly compares one set of plain data vs one set of gzipped data. https://gist.github.com/billdueber/87b4f100c4a2d5d1c756470e09615857

Sample output:

✦ ❯ ruby slow_gzip_bench.rb 1000 10 10

truffleruby 22.3.0, like ruby 3.0.3, GraalVM CE Native [x86_64-darwin]
Base unit is an array of 20 integers
Rows: 1000; Warmup seconds: 10; run test for 10 seconds

Warming up --------------------------------------
               plain     7.000  i/100ms
              gzdata     1.000  i/100ms
Calculating -------------------------------------
               plain     92.525  (±10.8%) i/s -    924.000  in  10.134733s
              gzdata      0.176  (± 0.0%) i/s -      2.000  in  11.384485s

Comparison:
               plain:       92.5 i/s
              gzdata:        0.2 i/s - 526.66x  (± 0.00) slower

eregon commented 1 year ago

Thank you, I can reproduce it with truffleruby-dev. And adding:

if IS_TRUFFLE
  FILEDATA["gzdata"] = Truffle::Debug.flatten_string(FILEDATA["gzdata"])
end

makes them both the same speed.

billdueber commented 3 months ago

Just checking in again every year-and-a-half or so to see if there's any progress or a "won't-fix" on this. Still using the simple benchmark from https://gist.github.com/billdueber/87b4f100c4a2d5d1c756470e09615857

truffleruby 24.0.1, like ruby 3.2.2, Oracle GraalVM JVM [aarch64-darwin]
Base unit is an array of 20 integers
Rows: 300; Warmup seconds: 5; run test for 5 seconds

truffleruby 24.0.1, like ruby 3.2.2, Oracle GraalVM JVM [aarch64-darwin]
Warming up --------------------------------------
               plain    96.000 i/100ms
              gzdata     1.000 i/100ms
Calculating -------------------------------------
               plain      1.174k (± 4.5%) i/s -      5.856k in   4.999764s
              gzdata     15.517 (±12.9%) i/s -     77.000 in   5.026993s

Comparison:
               plain:     1174.2 i/s
              gzdata:       15.5 i/s - 75.67x  slower

Thanks!

eregon commented 3 months ago

One workaround/solution for this is to use the json gem native extension, e.g. with gem install json or including json in the Gemfile. I'm using the latest JVM EA build with ruby-build truffleruby+graalvm-dev.

$ ruby slow_gzip_bench.rb
Base unit is an array of 20 integers
Rows: 300; Warmup seconds: 5; run test for 5 seconds

truffleruby 24.2.0-dev-7484fe29, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
Warming up --------------------------------------
               plain    29.000 i/100ms
              gzdata     1.000 i/100ms
Calculating -------------------------------------
               plain    505.279 (± 6.5%) i/s    (1.98 ms/i) -      2.523k in   5.032136s
              gzdata      8.888 (±22.5%) i/s  (112.51 ms/i) -     42.000 in   5.024406s

Comparison:
               plain:      505.3 i/s
              gzdata:        8.9 i/s - 56.85x  slower

So this performance issue still exists, although these days it seems because the SwitchEncodingNode in MatchInRegionTRegexNode (vs BytesNode previously).

After gem install json:

$ ruby slow_gzip_bench.rb                       
Base unit is an array of 20 integers
Rows: 300; Warmup seconds: 5; run test for 5 seconds

truffleruby 24.2.0-dev-7484fe29, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
Warming up --------------------------------------
               plain    10.000 i/100ms
              gzdata    10.000 i/100ms
Calculating -------------------------------------
               plain    107.540 (± 0.9%) i/s    (9.30 ms/i) -    540.000 in   5.021937s
              gzdata    106.373 (± 2.8%) i/s    (9.40 ms/i) -    540.000 in   5.080027s

Comparison:
               plain:      107.5 i/s
              gzdata:      106.4 i/s - same-ish: difference falls within error

Same speed for plain & gzdata, but much slower than the pure-Ruby JSON.

But we can also use the new --cexts-panama option to run C extensions faster:

(without installing the json gem beforehand)
$ ruby --experimental-options --cexts-panama slow_gzip_bench.rb
Base unit is an array of 20 integers
Rows: 300; Warmup seconds: 5; run test for 5 seconds

truffleruby 24.2.0-dev-7484fe29, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
Warming up --------------------------------------
               plain    31.000 i/100ms
              gzdata    31.000 i/100ms
Calculating -------------------------------------
               plain    319.096 (± 1.6%) i/s    (3.13 ms/i) -      1.612k in   5.052888s
              gzdata    321.968 (± 0.9%) i/s    (3.11 ms/i) -      1.612k in   5.007179s

Comparison:
              gzdata:      322.0 i/s
               plain:      319.1 i/s - same-ish: difference falls within error

Same speed for plain & gzdata, and close to the pure-Ruby JSON.

In general we are working on improving JSON speed as it's been identified as significantly slower than it could be (cc @andrykonchin).

eregon commented 3 months ago

Pure-Ruby JSON and with a fix to avoid repeated native->managed conversions:

$ jt -u jvm-ee ruby slow_gzip_bench.rb
Base unit is an array of 20 integers
Rows: 300; Warmup seconds: 5; run test for 5 seconds

truffleruby 24.2.0-dev-001cccc2*, like ruby 3.2.4, Oracle GraalVM JVM [x86_64-linux]
Warming up --------------------------------------
               plain    17.000 i/100ms
              gzdata    49.000 i/100ms
Calculating -------------------------------------
               plain    488.617 (± 9.6%) i/s    (2.05 ms/i) -      2.414k in   5.027094s
              gzdata    484.828 (±10.9%) i/s    (2.06 ms/i) -      2.401k in   5.066445s

Comparison:
               plain:      488.6 i/s
              gzdata:      484.8 i/s - same-ish: difference falls within error

So that solves this issue, will put up a PR for it.