Marshal.load of Hashes slow, especially glacial when vm cold

LillianZ commented 3 years ago

Loading the Hash of Hash of Hash below takes 0.075s on MRI, 0.14s at Truffleruby peak, ... and 3s the first time it is run (21x slower than warm).

This is very unfortunate because in most applications we would only expect to to use Marshal load once at the beginning.

require 'benchmark'
alphabet = "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z".split(", ")
symalphabet = [:AA, :AB, :AC, :AD, :AE, :AF, :AG, :AH, :AI, :AJ, 
 :BA, :BB, :BC, :BD, :BE, :BF, :BG, :BH, :BI, :BJ, 
 :CA, :CB, :CC, :CD, :CE, :CF, :CG, :CH, :CI, :CJ, 
 :DA, :DB, :DC, :DD, :DE, :DF, :DG, :DH, :DI, :DJ, 
 :EA, :EB, :EC, :ED, :EE, :EF, :EG, :EH, :EI, :EJ, 
 :FA, :FB, :FC, :FD, :FE, :FF, :FG, :FH, :FI, :FJ, 
 :GA, :GB, :GC, :GD, :GE, :GF, :GG, :GH, :GI, :GJ, 
 :HA, :HB, :HC, :HD, :HE, :HF, :HG, :HH, :HI, :HJ, 
 :IA, :IB, :IC, :ID, :IE, :IF, :IG, :IH, :II, :IJ, 
 :JA, :JB, :JC, :JD, :JE, :JF, :JG, :JH, :JI, :JJ, 
]
bigh = Hash.new
alphabet.each do |letter1|
  alphabet.each do |letter2|
    h = Hash.new
    symalphabet.each do |symbol|
      h[symbol] = {:a => "String One", :b => "String Two", :c => nil}
    end
    bigh[letter1+letter2] = h
  end
end
str = Marshal.dump(bigh)
puts Benchmark.measure {Marshal.load(str)}

chrisseaton commented 3 years ago

I added Benchmark.ips to the end of this so we can see cold and warmup (maybe bips should print first-iteration...)

require 'benchmark'
require 'benchmark/ips'
alphabet = "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z".split(", ")
symalphabet = [:AA, :AB, :AC, :AD, :AE, :AF, :AG, :AH, :AI, :AJ, 
 :BA, :BB, :BC, :BD, :BE, :BF, :BG, :BH, :BI, :BJ, 
 :CA, :CB, :CC, :CD, :CE, :CF, :CG, :CH, :CI, :CJ, 
 :DA, :DB, :DC, :DD, :DE, :DF, :DG, :DH, :DI, :DJ, 
 :EA, :EB, :EC, :ED, :EE, :EF, :EG, :EH, :EI, :EJ, 
 :FA, :FB, :FC, :FD, :FE, :FF, :FG, :FH, :FI, :FJ, 
 :GA, :GB, :GC, :GD, :GE, :GF, :GG, :GH, :GI, :GJ, 
 :HA, :HB, :HC, :HD, :HE, :HF, :HG, :HH, :HI, :HJ, 
 :IA, :IB, :IC, :ID, :IE, :IF, :IG, :IH, :II, :IJ, 
 :JA, :JB, :JC, :JD, :JE, :JF, :JG, :JH, :JI, :JJ, 
]
bigh = Hash.new
alphabet.each do |letter1|
  alphabet.each do |letter2|
    h = Hash.new
    symalphabet.each do |symbol|
      h[symbol] = {:a => "String One", :b => "String Two", :c => nil}
    end
    bigh[letter1+letter2] = h
  end
end
str = Marshal.dump(bigh)
puts Benchmark.measure {Marshal.load(str)}

Benchmark.ips do |x|
  x.warmup = 5
  x.report('Marshal.load') { Marshal.load(str) }
end

ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin19]
  0.069405   0.008819   0.078224 (  0.078383)
Warming up --------------------------------------
        Marshal.load     1.000  i/100ms
Calculating -------------------------------------
        Marshal.load     17.458  (±11.5%) i/s -     87.000  in   5.033390s

truffleruby (Shopify) 20.3.0-dev-e574e0f3, like ruby 2.6.6, GraalVM CE Native [x86_64-darwin]
 10.043862   1.585444  11.629306 (  5.489669)
Warming up --------------------------------------
        Marshal.load     1.000  i/100ms
Calculating -------------------------------------
        Marshal.load      0.945  (± 0.0%) i/s -      5.000  in   5.452409s

	Cold	Hot
MRI	0.078383s	17.458 ips
TruffleRuby	5.489669	0.945 ips

So this is bad on both cold performance and also hot performance. Thermometer confirms hot during measurement. What we really care about at the moment is the cold performance though - we load a Marshal blob on startup and it's painfully slow.

LillianZ commented 3 years ago

The example Marshalled string has ~3M characters; the Marshalled string we load on startup has ~17M characters. It is indeed slow.

eregon commented 3 years ago

I made a variant of the script where the result of Marshal.dump is saved to a file. That way, we don't call Marshal.dump during the benchmark, which I guess is also the case for the application's startup? https://gist.github.com/eregon/e523df1b607bd152a9eef980fd936b8c

I see first iteration times of around 2 seconds, which is slow but probably not that noticeable during startup. It would be nice to have something about the size of the real data, so we can get a feeling of how long it takes and how much we gain on it when optimizing.

Profiling (see the gist) reveals most of the time is spent in Marshal::State#construct which is a somewhat large method dispatching via a big case expression, and 2nd place is Integer#===, most likely for that big case expression.

One obvious thing here is we could AST-inline Integer#=== to avoid extra boxing, call overhead in interpreter, etc.

Then there are probably better ways to optimize that big case expression since all values are Integer. Yet the values are not contiguous. IIRC There is an optimization in graal for turning if e == SOME_INT1 else if e == SOME_INT2 else if ... to a Java switch, we should check if it triggers here. Maybe a Ruby Array of Proc's would be better for dispatch. That way we'd likely wouldn't inline anything over that call, but then we'd have a clear place where to stop and start inlining. We might want to treat special case like true/false/nil specially.

chrisseaton commented 3 years ago

That way, we don't call Marshal.dump during the benchmark, which I guess is also the case for the application's startup?

Yes it comes from a file. There's some other steps involved - we'll open separate issues for those too.

petenorth commented 3 years ago

A lot of time is spent in Marshal::State#construct for the rails rspec tests of an app I've been testing with TruffleRuby.

gogainda commented 3 years ago

Sorry for my ignorance but doens't Marshal::State#construct has O(n) pef here? Would it be better to have some map-like implementation?

eregon commented 3 years ago

Yes, case in general is O(n) in Ruby because it's just if when1 === expr elsif when2 === expr elsif ... semantically. But I think CRuby optimizes when it's all when SOME_INTEGER to use a Hash internally or so. A Hash has its own costs though.

It would be cool to see if we can get that Graal optimization to kick in and transform it essentially to a Java switch which uses various strategies depending on the when values. Yet most of the time in construct in the profiles is in interpreter, so whatever we do needs to be efficient in interpreter too.

chrisseaton commented 3 years ago

It would be cool to see if we can get that Graal optimization to kick in

But remember it's cold for quite a bit of this run.

petenorth commented 3 years ago

Could the JRuby implementation be used?

https://github.com/jruby/jruby/blob/de19365b3f95fbaacd3f45267bde14c09f469554/core/src/main/java/org/jruby/runtime/marshal/UnmarshalStream.java#L73

Java code like this would be AOT compiled into the TruffleRuby native image executable, no?

Or is this against an over arching design ethos?

eregon commented 3 years ago

Potentially we could rewrite Marshal in Java, that's a possibility. We cannot just "reuse" JRuby's implementation because JRuby's uses different representations than TruffleRuby for basically every Ruby object/value. We could also only have construct as a Java @CoreMethod if the goal is to use Java's switch. Java code is typically much harder/higher effort to maintain than Ruby code, so it would need to be worth it. Java code is generally beneficial for startup/warmup performance, but is a high cost to maintain and is often extremely verbose.

So yes, one possibility, but I think we should explore a few other things first (notably AST-inline Integer#===). If we end up optimizing Ruby's case it will apply not only to Marshal but to many other case in Ruby code.

eregon commented 3 years ago

As a side note, for this use case, if the marshalled objects are rather simple, it might be feasible to store them as part of pre-initialization when building the TruffleRuby Native Image. Then as soon as TruffleRuby starts up, the objects would already be in the heap, and there would be no deserialization at all. Speeding up Marshal.load is worth on its own, but that sounds an interesting optimization to optimize application startup.

eregon commented 3 years ago

It seems quite a bit faster after AST-inlining Integer#===: 686064726bed7b800f7698187b3db75a69c8bfbb Before (06662f0c):

$ precise-time 10 `which ruby` bench_marshal_load.rb
2146 [1956-2356] ms

After (929d5279)

$ precise-time 10 `which ruby` bench_marshal_load.rb
1735 [1556-1901] ms

The format is average [min-max]

eregon commented 3 years ago

If we use --engine.MultiTier we get down to 1 second (1066 [1058-1076] ms). Sampling without MultiTier:

$ ruby --experimental-options --cpusampler --cpusampler.SampleInternal bench_marshal_load.rb
 Name                                                         |      Total Time     |  Opt % ||       Self Time     |  Opt % | Location             
----------------------------------------------------------------------------------------------------------------------------------------------------
 Marshal::State#construct                                     |       1073ms  96.2% |   0.0% ||        337ms  30.2% |   0.9% | resource:/truffleruby/core/marshal.rb~561-654:14615-17159 
 Marshal::State#add_object                                    |        207ms  18.6% |   0.0% ||        119ms  10.7% |   0.0% | resource:/truffleruby/core/marshal.rb~549-553:14384-14494 
 block in Marshal::State#construct_hash                       |       1073ms  96.2% |   0.0% ||         98ms   8.8% |  24.5% | resource:/truffleruby/core/marshal.rb~760:19325-19356 
 Integer#times                                                |       1073ms  96.2% |   0.0% ||         73ms   6.5% |  35.6% | resource:/truffleruby/core/integer.rb~135-144:4106-4252 
 Marshal::State#construct_string                              |        193ms  17.3% |  18.7% ||         67ms   6.0% |  53.7% | resource:/truffleruby/core/marshal.rb~229-252:7053-7724

Sampling with MultiTier:

$ ruby --experimental-options --engine.MultiTier --cpusampler --cpusampler.SampleInternal bench_marshal_load.rb
 Name                                                         |      Total Time     |  Opt % ||       Self Time     |  Opt % | Location             
----------------------------------------------------------------------------------------------------------------------------------------------------
 Marshal::State#construct                                     |        669ms  94.2% |   0.0% ||        194ms  27.3% |  16.5% | resource:/truffleruby/core/marshal.rb~561-654:14615-17159 
 Marshal::StringState#consume_byte                            |         65ms   9.2% |  58.5% ||         53ms   7.5% |  54.7% | resource:/truffleruby/core/marshal.rb~1262-1267:31843-32023 
 block in Marshal::State#construct_hash                       |        668ms  94.1% |  35.2% ||         49ms   6.9% |  38.8% | resource:/truffleruby/core/marshal.rb~760:19325-19356 
 Integer#times                                                |        668ms  94.1% |   0.0% ||         40ms   5.6% |  50.0% | resource:/truffleruby/core/integer.rb~135-144:4106-4252 
 Marshal::State#get_byte_sequence                             |         83ms  11.7% |  77.1% ||         35ms   4.9% |  82.9% | resource:/truffleruby/core/marshal.rb~945-948:23887-23969

norswap commented 3 years ago

I'm going to take a stab at this. Is the symalphabet example from above representative of the actual 17M workload, or does that have a more varied structure?

LillianZ commented 3 years ago

Thank you for looking into this!

Essentially that structure, except I wasn't very faithful to the size of the hashes and sometimes instead of the nil in h[symbol] = {:a => "String One", :b => "String Two", :c => nil} we have an array (of any size) of hashes of size 2 as in h[symbol] = {:a => "String One", :b => "String Two", :c => [{:a=>"StringA", :b=>"StringB"}, {:a=>"StringA", :b=>"StringB"}, ... ]}

So it's a Hash (of size 806) of Hash (of size 242) of Hash (of size 3) of an Array (of any size) of a Hash(of size 2). Or nil instead of the array.

I'n not sure if the key being a symbol or string makes a difference, but it's a string in the Hash of size 242, and a symbol otherwise. (So I actually use the string key in the wrong hash in my example). Also, not sure if it makes a difference, but half the strings are unique, and the other half each occur 806 times.

norswap commented 3 years ago

I was able to confirm that the case statement does indeed get optimized into a switch when compilation hits State#construct. The issue is that this never happens for a single load call, not even with a 17M dump.

One avenue is to improve interpreter performance, I will try to replace the case with a java switch to see how that affect things. The second avenue is to make sure the method gets compiled. Here things may be negatively affected by splitting as construct splits a lot. We do have the project to review splitting policies anyway.

eregon commented 3 years ago

Can you find how many times construct is called via --cputracer? It should be compiled rather early. The options --engine.CompilerThreads=-1 and --engine.MultiTier might help if the compile queue is too busy. What's the total time for 1 load with both of these options on the 17MB file?

Regarding splitting, can you try never_split on construct as an experiment? Finding out why it splits seems interesting too.

norswap commented 3 years ago

Well... only 500 million times or so :D And it is indeed compiler for 99.9% of calls.

However there seems to be a snag because I get an extremely limited set of graph dumps, none of which include construct (confirmed via --engine.TraceInlining), namely I only get (with jvm-ce):

TruffleHotSpotCompilation-11356[Symbol#==].bgv
TruffleHotSpotCompilation-11486[IO::InternalBuffer#unused].bgv
TruffleHotSpotCompilation-13081[IO::InternalBuffer#size].bgv
TruffleHotSpotCompilation-13409[IO#ensure_open].bgv
TruffleHotSpotCompilation-13597[IO::InternalBuffer#empty?].bgv
TruffleHotSpotCompilation-13647[IO::InternalBuffer#reset!].bgv
TruffleHotSpotCompilation-13702[IO::InternalBuffer#exhausted?].bgv
TruffleHotSpotCompilation-13772[IO::InternalBuffer#fill_from].bgv
TruffleHotSpotCompilation-13847[Truffle::System.synchronized_<split-2fe6a77b>].bgv
TruffleHotSpotCompilation-13973[Truffle::POSIX.read_string_at_least_one_byte].bgv

Brandon ran the same command and obtained a much longer list of dumps. I'll try to find what's wrong.

I also conducted the experiment with replacing case with a primitive that calls a switch:

	`case`	`switch`
`jvm-ce`	24.677485 0.837709 25.515194 ( 6.278047)	21.656876 0.895608 22.552484 ( 4.990113)
`native`	14.380615 1.221126 15.601741 ( 8.914388)	13.642103 0.971698 14.613801 ( 8.198556)

timings are in user system total (real) format
there's quite a bit of variance to these measurements, but switch is consistently faster, and the reduction is consistently greater on jvm-ce
The java switch implem doesn't profile its branches except the default error case. This might actually be an advantage.

Truffle::Graal::never_split makes things faster, and erases the benefits of switch:

	`case`	`switch`
`jvm-ce`	21.302149 0.852032 22.154181 ( 4.922792)	18.417463 0.732743 19.150206 ( 4.921437)
`native`	10.170745 0.783382 10.954127 ( 6.587971)	12.047677 0.732863 12.780540 ( 7.146432)

norswap commented 3 years ago

Updated version of the script to generate a ~17M dump that is closer to actual input: https://gist.github.com/norswap/665cba2929366ca6b6e4431d69931c8a (used in measurements above).

Other: disabling splitting altogether seems to result in a mild (~20%) slowdown.

oracle / truffleruby

Marshal.load of Hashes slow, especially glacial when vm cold #2085