Open LillianZ opened 3 years ago
I added Benchmark.ips
to the end of this so we can see cold and warmup (maybe bips should print first-iteration...)
require 'benchmark'
require 'benchmark/ips'
alphabet = "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z".split(", ")
symalphabet = [:AA, :AB, :AC, :AD, :AE, :AF, :AG, :AH, :AI, :AJ,
:BA, :BB, :BC, :BD, :BE, :BF, :BG, :BH, :BI, :BJ,
:CA, :CB, :CC, :CD, :CE, :CF, :CG, :CH, :CI, :CJ,
:DA, :DB, :DC, :DD, :DE, :DF, :DG, :DH, :DI, :DJ,
:EA, :EB, :EC, :ED, :EE, :EF, :EG, :EH, :EI, :EJ,
:FA, :FB, :FC, :FD, :FE, :FF, :FG, :FH, :FI, :FJ,
:GA, :GB, :GC, :GD, :GE, :GF, :GG, :GH, :GI, :GJ,
:HA, :HB, :HC, :HD, :HE, :HF, :HG, :HH, :HI, :HJ,
:IA, :IB, :IC, :ID, :IE, :IF, :IG, :IH, :II, :IJ,
:JA, :JB, :JC, :JD, :JE, :JF, :JG, :JH, :JI, :JJ,
]
bigh = Hash.new
alphabet.each do |letter1|
alphabet.each do |letter2|
h = Hash.new
symalphabet.each do |symbol|
h[symbol] = {:a => "String One", :b => "String Two", :c => nil}
end
bigh[letter1+letter2] = h
end
end
str = Marshal.dump(bigh)
puts Benchmark.measure {Marshal.load(str)}
Benchmark.ips do |x|
x.warmup = 5
x.report('Marshal.load') { Marshal.load(str) }
end
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin19]
0.069405 0.008819 0.078224 ( 0.078383)
Warming up --------------------------------------
Marshal.load 1.000 i/100ms
Calculating -------------------------------------
Marshal.load 17.458 (±11.5%) i/s - 87.000 in 5.033390s
truffleruby (Shopify) 20.3.0-dev-e574e0f3, like ruby 2.6.6, GraalVM CE Native [x86_64-darwin]
10.043862 1.585444 11.629306 ( 5.489669)
Warming up --------------------------------------
Marshal.load 1.000 i/100ms
Calculating -------------------------------------
Marshal.load 0.945 (± 0.0%) i/s - 5.000 in 5.452409s
Cold | Hot | |
---|---|---|
MRI | 0.078383s | 17.458 ips |
TruffleRuby | 5.489669 | 0.945 ips |
So this is bad on both cold performance and also hot performance. Thermometer confirms hot during measurement. What we really care about at the moment is the cold performance though - we load a Marshal blob on startup and it's painfully slow.
The example Marshalled string has ~3M characters; the Marshalled string we load on startup has ~17M characters. It is indeed slow.
I made a variant of the script where the result of Marshal.dump
is saved to a file.
That way, we don't call Marshal.dump
during the benchmark, which I guess is also the case for the application's startup?
https://gist.github.com/eregon/e523df1b607bd152a9eef980fd936b8c
I see first iteration times of around 2 seconds, which is slow but probably not that noticeable during startup. It would be nice to have something about the size of the real data, so we can get a feeling of how long it takes and how much we gain on it when optimizing.
Profiling (see the gist) reveals most of the time is spent in Marshal::State#construct which is a somewhat large method dispatching via a big case
expression, and 2nd place is Integer#===
, most likely for that big case
expression.
One obvious thing here is we could AST-inline Integer#===
to avoid extra boxing, call overhead in interpreter, etc.
Then there are probably better ways to optimize that big case
expression since all values are Integer.
Yet the values are not contiguous.
IIRC There is an optimization in graal for turning if e == SOME_INT1 else if e == SOME_INT2 else if ...
to a Java switch, we should check if it triggers here.
Maybe a Ruby Array of Proc's would be better for dispatch. That way we'd likely wouldn't inline anything over that call, but then we'd have a clear place where to stop and start inlining. We might want to treat special case like true/false/nil specially.
That way, we don't call Marshal.dump during the benchmark, which I guess is also the case for the application's startup?
Yes it comes from a file. There's some other steps involved - we'll open separate issues for those too.
A lot of time is spent in Marshal::State#construct for the rails rspec tests of an app I've been testing with TruffleRuby.
Sorry for my ignorance but doens't Marshal::State#construct has O(n) pef here? Would it be better to have some map-like implementation?
Yes, case
in general is O(n) in Ruby because it's just if when1 === expr elsif when2 === expr elsif ...
semantically. But I think CRuby optimizes when it's all when SOME_INTEGER
to use a Hash internally or so. A Hash has its own costs though.
It would be cool to see if we can get that Graal optimization to kick in and transform it essentially to a Java switch which uses various strategies depending on the when
values. Yet most of the time in construct
in the profiles is in interpreter, so whatever we do needs to be efficient in interpreter too.
It would be cool to see if we can get that Graal optimization to kick in
But remember it's cold for quite a bit of this run.
Could the JRuby implementation be used?
Java code like this would be AOT compiled into the TruffleRuby native image executable, no?
Or is this against an over arching design ethos?
Potentially we could rewrite Marshal in Java, that's a possibility. We cannot just "reuse" JRuby's implementation because JRuby's uses different representations than TruffleRuby for basically every Ruby object/value.
We could also only have construct
as a Java @CoreMethod
if the goal is to use Java's switch
. Java code is typically much harder/higher effort to maintain than Ruby code, so it would need to be worth it. Java code is generally beneficial for startup/warmup performance, but is a high cost to maintain and is often extremely verbose.
So yes, one possibility, but I think we should explore a few other things first (notably AST-inline Integer#===
). If we end up optimizing Ruby's case
it will apply not only to Marshal but to many other case
in Ruby code.
As a side note, for this use case, if the marshalled objects are rather simple, it might be feasible to store them as part of pre-initialization when building the TruffleRuby Native Image. Then as soon as TruffleRuby starts up, the objects would already be in the heap, and there would be no deserialization at all. Speeding up Marshal.load is worth on its own, but that sounds an interesting optimization to optimize application startup.
It seems quite a bit faster after AST-inlining Integer#===
: 686064726bed7b800f7698187b3db75a69c8bfbb
Before (06662f0c):
$ precise-time 10 `which ruby` bench_marshal_load.rb
2146 [1956-2356] ms
After (929d5279)
$ precise-time 10 `which ruby` bench_marshal_load.rb
1735 [1556-1901] ms
The format is average [min-max]
If we use --engine.MultiTier
we get down to 1 second (1066 [1058-1076] ms
).
Sampling without MultiTier:
$ ruby --experimental-options --cpusampler --cpusampler.SampleInternal bench_marshal_load.rb
Name | Total Time | Opt % || Self Time | Opt % | Location
----------------------------------------------------------------------------------------------------------------------------------------------------
Marshal::State#construct | 1073ms 96.2% | 0.0% || 337ms 30.2% | 0.9% | resource:/truffleruby/core/marshal.rb~561-654:14615-17159
Marshal::State#add_object | 207ms 18.6% | 0.0% || 119ms 10.7% | 0.0% | resource:/truffleruby/core/marshal.rb~549-553:14384-14494
block in Marshal::State#construct_hash | 1073ms 96.2% | 0.0% || 98ms 8.8% | 24.5% | resource:/truffleruby/core/marshal.rb~760:19325-19356
Integer#times | 1073ms 96.2% | 0.0% || 73ms 6.5% | 35.6% | resource:/truffleruby/core/integer.rb~135-144:4106-4252
Marshal::State#construct_string | 193ms 17.3% | 18.7% || 67ms 6.0% | 53.7% | resource:/truffleruby/core/marshal.rb~229-252:7053-7724
Sampling with MultiTier:
$ ruby --experimental-options --engine.MultiTier --cpusampler --cpusampler.SampleInternal bench_marshal_load.rb
Name | Total Time | Opt % || Self Time | Opt % | Location
----------------------------------------------------------------------------------------------------------------------------------------------------
Marshal::State#construct | 669ms 94.2% | 0.0% || 194ms 27.3% | 16.5% | resource:/truffleruby/core/marshal.rb~561-654:14615-17159
Marshal::StringState#consume_byte | 65ms 9.2% | 58.5% || 53ms 7.5% | 54.7% | resource:/truffleruby/core/marshal.rb~1262-1267:31843-32023
block in Marshal::State#construct_hash | 668ms 94.1% | 35.2% || 49ms 6.9% | 38.8% | resource:/truffleruby/core/marshal.rb~760:19325-19356
Integer#times | 668ms 94.1% | 0.0% || 40ms 5.6% | 50.0% | resource:/truffleruby/core/integer.rb~135-144:4106-4252
Marshal::State#get_byte_sequence | 83ms 11.7% | 77.1% || 35ms 4.9% | 82.9% | resource:/truffleruby/core/marshal.rb~945-948:23887-23969
I'm going to take a stab at this.
Is the symalphabet
example from above representative of the actual 17M workload, or does that have a more varied structure?
Thank you for looking into this!
Essentially that structure, except I wasn't very faithful to the size of the hashes and sometimes instead of the nil in
h[symbol] = {:a => "String One", :b => "String Two", :c => nil}
we have an array (of any size) of hashes of size 2 as in
h[symbol] = {:a => "String One", :b => "String Two", :c => [{:a=>"StringA", :b=>"StringB"}, {:a=>"StringA", :b=>"StringB"}, ... ]}
So it's a Hash (of size 806) of Hash (of size 242) of Hash (of size 3) of an Array (of any size) of a Hash(of size 2). Or nil instead of the array.
I'n not sure if the key being a symbol or string makes a difference, but it's a string in the Hash of size 242, and a symbol otherwise. (So I actually use the string key in the wrong hash in my example). Also, not sure if it makes a difference, but half the strings are unique, and the other half each occur 806 times.
I was able to confirm that the case
statement does indeed get optimized into a switch
when compilation hits State#construct
. The issue is that this never happens for a single load
call, not even with a 17M dump.
One avenue is to improve interpreter performance, I will try to replace the case
with a java switch
to see how that affect things. The second avenue is to make sure the method gets compiled. Here things may be negatively affected by splitting as construct
splits a lot. We do have the project to review splitting policies anyway.
Can you find how many times construct
is called via --cputracer? It should be compiled rather early. The options --engine.CompilerThreads=-1 and --engine.MultiTier might help if the compile queue is too busy.
What's the total time for 1 load with both of these options on the 17MB file?
Regarding splitting, can you try never_split
on construct as an experiment? Finding out why it splits seems interesting too.
Well... only 500 million times or so :D And it is indeed compiler for 99.9% of calls.
However there seems to be a snag because I get an extremely limited set of graph dumps, none of which include construct
(confirmed via --engine.TraceInlining
), namely I only get (with jvm-ce
):
TruffleHotSpotCompilation-11356[Symbol#==].bgv
TruffleHotSpotCompilation-11486[IO::InternalBuffer#unused].bgv
TruffleHotSpotCompilation-13081[IO::InternalBuffer#size].bgv
TruffleHotSpotCompilation-13409[IO#ensure_open].bgv
TruffleHotSpotCompilation-13597[IO::InternalBuffer#empty?].bgv
TruffleHotSpotCompilation-13647[IO::InternalBuffer#reset!].bgv
TruffleHotSpotCompilation-13702[IO::InternalBuffer#exhausted?].bgv
TruffleHotSpotCompilation-13772[IO::InternalBuffer#fill_from].bgv
TruffleHotSpotCompilation-13847[Truffle::System.synchronized_<split-2fe6a77b>].bgv
TruffleHotSpotCompilation-13973[Truffle::POSIX.read_string_at_least_one_byte].bgv
Brandon ran the same command and obtained a much longer list of dumps. I'll try to find what's wrong.
I also conducted the experiment with replacing case
with a primitive that calls a switch
:
case |
switch |
|
---|---|---|
jvm-ce |
24.677485 0.837709 25.515194 ( 6.278047) | 21.656876 0.895608 22.552484 ( 4.990113) |
native |
14.380615 1.221126 15.601741 ( 8.914388) | 13.642103 0.971698 14.613801 ( 8.198556) |
user system total (real)
formatswitch
is consistently faster, and the reduction is consistently greater on jvm-ce
switch
implem doesn't profile its branches except the default error case.
This might actually be an advantage.Truffle::Graal::never_split
makes things faster, and erases the benefits of switch
:
case |
switch |
|
---|---|---|
jvm-ce |
21.302149 0.852032 22.154181 ( 4.922792) | 18.417463 0.732743 19.150206 ( 4.921437) |
native |
10.170745 0.783382 10.954127 ( 6.587971) | 12.047677 0.732863 12.780540 ( 7.146432) |
Updated version of the script to generate a ~17M dump that is closer to actual input: https://gist.github.com/norswap/665cba2929366ca6b6e4431d69931c8a (used in measurements above).
Other: disabling splitting altogether seems to result in a mild (~20%) slowdown.
Loading the Hash of Hash of Hash below takes 0.075s on MRI, 0.14s at Truffleruby peak, ... and 3s the first time it is run (21x slower than warm).
This is very unfortunate because in most applications we would only expect to to use Marshal load once at the beginning.