oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.
https://www.graalvm.org/ruby/
Other
3k stars 183 forks source link

Bundler is slow on TruffleRuby #1398

Closed eregon closed 5 months ago

eregon commented 6 years ago

Currently running Bundler on TruffleRuby is very slow.

We need to profile, figure out the bottlenecks and optimize them so we get around the same speed as MRI when installing gems via Bundler.

gem install on TruffleRuby is also not fast for installing gems, there are probably common bottlenecks between the two.

logicminds commented 6 years ago

Running just the gem command by itself takes 2700 ms on my system under truffleruby.

By comparison the gem command from MRI takes 100 ms.

eregon commented 6 years ago

With 1.0.0-rc5, on my system, gem takes around 0.768s (vs 0.091s for MRI 2.5.1). I'm trying to investigate where the time is spent.

eregon commented 6 years ago

It seems the time of gem or gem help depends on the number of gems installed, so that might explain much larger time for you if you have many gems installed.

eregon commented 6 years ago

I extended the time metrics we have to detail how time is spent in require, and created a FlameGraph from that (click the image for the interactive version): time_metrics_flamegraph

Some interesting potentials for optimization there:

chrisseaton commented 6 years ago

Loading the C-ext support is long, due to parsing cext.rb

Didn't we used to have cext.rb as part of core? And in there it would now be parsed during image generation.

logicminds commented 6 years ago

Looks like the rc5 did improve the load time.

root@a8ae5af17715:/app# time gem
RubyGems is a sophisticated package manager for Ruby.  This is a
basic help message containing pointers to more information.

    (removed content)
    http://guides.rubygems.org

real    0m1.356s
user    0m1.760s
sys     0m0.220s

This is much improved over previous version. Note: I have 37 gems installed.

Let me know command or tool to use to produce the graph.

You can use my container image: docker run -ti --rm logicminds/truffleruby bash (although this have no gems installed) Note: 1.5Gb image size

eregon commented 6 years ago

Didn't we used to have cext.rb as part of core? And in there it would now be parsed during image generation.

@chrisseaton We did, but it was suboptimal on JVM. I think we can just load it during context pre-initialization with require, that should be easy.

Note that on JVM the flamegraph for gem help looks very different, and there is much more time spent loading ruby.su (444ms) and not much parsing cext.rb (5ms):

time_metrics_flamegraph

I'd guess this is due to having to classload Sulong for ruby.su, and OTOH having a warmed-up parser (and so avoiding to parse cext.rb might just move the problem around, but we should try anyway).

@logicminds Good to hear :) I'm currently integrating the generation of the flamegraph to master. Unfortunately, I needed some modifications to metrics so measuring with 1.0.0-rc5 won't work.

eregon commented 6 years ago

I tried require "truffle/cext" during pre-initialization, but it doesn't help (the long step then becomes loading ruby.su): https://cdn.rawgit.com/eregon/243fa85e06c1597d165610b9c553e8e4/raw/8ed12db2c474566feba63e0d87b8141cc1999acc/time_metrics_flamegraph.svg

So this reveals the metrics above don't visualize GC, which happens right during loading C-ext support:

$ ruby --native.XX:+PrintGC --native.Dtruffleruby.metrics.time=true -Xmetrics.time.parsing.file=true -Xmetrics.time.require=true -S gem help
...
before-load-cext-support 1533485020198
before-load-cext-ruby.su 1533485020199
[Incremental GC (CollectOnAllocation.Sometimes) 263853K->44119K, 0.1217002 secs]
after-load-cext-ruby.su 1533485020370
after-load-cext-support 1533485020371
...
after-require-rubygems/commands/help_command 1533485020603
[Full GC (CollectOnAllocation.Sometimes) 305781K->70088K, 0.2343588 secs]
RubyGems is a sophisticated package manager for Ruby.  This is a
basic help message containing pointers to more information.
...

We need to investigate why the GC takes so long here.

Making the young generation large enough to avoid GCs reduces the time quite a bit:

$ time bin/native-ruby -S gem help
1.26s user 0.10s system 133% cpu 1.024 total

$ time bin/native-ruby --native.XX:+PrintGC --native.Xmn1g -S gem help
0.94s user 0.15s system 148% cpu 0.732 total
eregon commented 6 years ago

And here is a flamegraph with no GC:

TRUFFLERUBY_METRICS_REPS=1 TRUFFLERUBY_METRICS_MIN_TIME=0.003 jt metrics time --native --flamegraph --native.Xmn1g -Xmetrics.time.parsing.file=true -Xmetrics.time.require=true -S gem help
Updated command:
TRUFFLERUBY_METRICS_REPS=1 jt -u native metrics time --flamegraph --vm.Xmn1g --experimental-options --metrics-time-parsing-file --metrics-time-require -S gem help

time_metrics_flamegraph1

eregon commented 6 years ago

I tweaked the script to also include GC in the FlameGraph. Furthermore, I generated a FlameChart, a variant where the horizontal axis is time, so this shows what happens over time (left to right) instead of sorting alphabetically. screenshot from 2018-08-06 21-49-34

Here we can search for GC and see it takes 32.1% of the time, which we should be able to improve upon.

logicminds commented 6 years ago

What GC are you using? G1?

eregon commented 6 years ago

@logicminds It's the GC of SubstateVM, not a JVM GC. It is not yet as efficient as JVM GCs.

logicminds commented 6 years ago

btw, the flame graph looks like a ship.

eregon commented 6 years ago

I filed an issue for the SubstrateVM team regarding the slow GC here (tracked internally as GR-11184).

thbar commented 5 years ago

I did more testing on TravisCI. bundle install took 239 secondes there on a simple case. I seem to recall "Fetching gem metadata" only appeared after a long while.

Logs available at https://travis-ci.org/thbar/kiba/jobs/424857239

eregon commented 5 years ago

@thbar Thanks, that's a small Gemfile and should make it easy to reproduce.

thbar commented 5 years ago

@eregon you welcome! Yes (this is why I added my comment), and CI instances are great "constrained resources" testbeds :smile:.

eregon commented 5 years ago

Some news on this front: running bundle install -j1 for a small Gemfile in TravisCI with some profiling options reveals one pretty obvious bottleneck:

screenshot from 2018-09-06 10-49-27

So we'll look at optimizing rb_str_set_len now.

deepj commented 5 years ago

@eregon here one experience https://github.com/jeremyevans/sequel/pull/1576#issuecomment-442615481. Due to this experience running Sequel specs on TruffleRuby on Travis CI has been removed.

deepj commented 5 years ago

@eregon @chrisseaton @nirvdrum If I could give an advice, improving performance of rubygems and bundler would be a high priority. After experiences with https://github.com/jeremyevans/sequel/pull/1576#issuecomment-442615481 and now with a try to run rom specs on Travis under TruffleRuby (it took over 21 minutes before running a first spec). See https://travis-ci.org/rom-rb/rom/jobs/462045524

This seems to be unacceptable for running tests/specs of various gems on the current underlying OSS CI infrastructure.

I've decided for now to stop adding TruffleRuby to Travis matrix of my focused gems due to the above mentioned issues.

My advice here is to focus on acceptable performance of bundler and rubygems since they are a underlying basis of everything in Ruby ecosystems.

eregon commented 5 years ago

I looked at this again, and while locally bundle install takes around 30s, it takes around 300s in TravisCI. The reason seems to be that downloading the Bundler index is very slow, an this can be reproduced locally with:

git clone https://github.com/ruby/mspec.git
cd mspec

rm -rf ~/.bundle vendor
bundle install --verbose -j1 --deployment

Profiling that shows a similar picture as above: rb_str_set_len called from rb_gzreader_read seems to be really slow, and I'll dig into that.

Also, I found out that Bundler seems to request information about many more gems than what seems needed: https://github.com/bundler/bundler/issues/6914

eregon commented 5 years ago

What causes rb_str_set_len to be called so many times is that Bundler downloads the list of all gem versions, every time in TravisCI (because it's not cached under ~/.bundle): https://github.com/bundler/bundler/issues/6915

eregon commented 5 years ago

I optimized rb_str_set_len in 70b870069ff95bc515812f1958571a68d452e027, which should make bundle install significantly faster in CI, because it is called thousands of times, in the inner loop of zlib decoding. Locally, I could measure a reduction from around 3min40s to 30-40s.

deepj commented 5 years ago

@eregon I can’t confirm so speed up of ‘bundle install`. Honestly, the speed is almost same on my machine like in the previous RC11.

eregon commented 5 years ago

@deepj Yes, it's not expected to be really faster locally yet, but on CI (at least TravisCI), where the list of all gem versions is downloaded every time, there should be a noticeable gain, particularly for smaller Gemfile.lock.

deepj commented 5 years ago

I always tried it from the scratch (all gems was removed before it)

mojavelinux commented 5 years ago

@deepj Same here. It hasn't really gotten any better.

Another part of the problem seems to be that RC12 takes a few tries to successfully install all the gems. And I am not talking about large applications here. I mostly work on text-processing utilities and even those take a very long time for Bundler to set up under truffleruby.

For example, kramdown-asciidoc has about 50 gem dependencies for development. Starting from scratch (which, to be honest, is quite common with Bundler), it fails the first time after 2m15s. It then succeeds the second time after 45s. So we're taking almost 3 minutes to get gems installed. (Compare that to 5s for CRuby). That's going to be a big barrier to adoption.

deepj commented 5 years ago

@eregon even bundle update is extremely slow. All gems are "cached"

$ time bundle update concurrent-ruby
Fetching gem metadata from https://rubygems.org/.........
Fetching gem metadata from https://rubygems.org/..........
Fetching gem metadata from https://rubygems.org/..........
Resolving dependencies..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Bundler could not find compatible versions for gem "concurrent-ruby":
  In Gemfile:
    rails was resolved to 5.2.2, which depends on
      activesupport (= 5.2.2) was resolved to 5.2.2, which depends on
        concurrent-ruby (>= 1.1)

    backburner was resolved to 1.4.1, which depends on
      concurrent-ruby (~> 1.0.1)
      256.45 real       467.35 user        66.21 sys

Ignore what I tried but I'd used Rails' Gemfile

deepj commented 5 years ago

@eregon just for an interest

https://travis-ci.org/dry-rb/dry-system/jobs/503018338

$ gem update --system 101.00s $ bundle install --without tools --path=${BUNDLE_PATH:-vendor/bundle} 216.09s

Very very slow :(

chrisseaton commented 5 years ago

It is very slow at the moment, yes. We've focused on peak performance and compatibility and not worked on cold performance in the same way. We did tackle the absolute startup time issue, but after those first hundred milliseconds it's then a different problem.

I don't think we're likely to get a fix for bundle performance in the short term, I'm afraid. We do know it's frustrating, but it's not as much of an absolute blocker as other issues.

ThomasKoppensteiner commented 4 years ago

Can someone please give an update on this issue?

chrisseaton commented 4 years ago

I can provide an update of how I see Bundler performance. For a production application at Shopify, bundle install takes 270s (wall-clock) on MRI, and 620s on TruffleRuby in the native configuration. That's 3x slower which is not great, I agree, but I don't see the prohibitive performance that we had in the past. And - more importantly for me - bundle install now works on far more gems. Do you see something else using Bundler?

mojavelinux commented 4 years ago

The 3x slower number is not universal. The dependencies for the Asciidoctor gem take considerably longer to install. 13s for CRuby vs ~300s for TruffleRuby. You can track the difference in this CI job: https://travis-ci.org/asciidoctor/asciidoctor/jobs/623359268

ThomasKoppensteiner commented 4 years ago

I'm also using Travis-CI for testing. Without caching bundle install --jobs=3 --retry=3 takes for

             2.4.9:  48.98s
             2.5.7:  40.88s
             2.6.5:  50.48s
    jruby-9.1.17.0:  40.07s
     jruby-9.2.9.0:  40.22s
truffleruby-19.3.0: 543.38s

TruffleRuby is in this case at least 10x slower than other ruby versions.

CI job: https://travis-ci.org/ThomasKoppensteiner/sidekiq-undertaker/builds/625889922

@mojavelinux 13s sounds fast for me. Did you enable caching?

eregon commented 4 years ago

Thanks for the extra data points, we definitely want to improve the situation there.

It seems the relative slowdown varies a lot depending on the number of gems using C extensions. Gems with C extensions seem to install roughly as fast on TruffleRuby as most of the time is spent compiling the C extension. On the other hand we still seem to be much slower for installing pure-Ruby gems, and for fetching/downloading gems.

Caching JIT'ed code seems a general way to address startup/warmup issues (#1835). However I think there are still several bottlenecks we could improve on significantly for bundle install independently.

eregon commented 4 years ago

An update for this: I optimized bundle install on a small Gemfile with Gemfile.lock to take 19s instead of 84s previously in 50a76f32b8187207625e0e6014816e97eafd0c17, that's a nice 4.42x speedup.

Many optimizations there, including optimizing RSTRING_PTR() accesses, using Mode=latency to make better use of compilation for this kind of workload and tweaking the GC heap size with Native Image.

Most of the execution during that time happens in the interpreter and not compiled code, so we'll want to investigate if we can do something about that (e.g., if it deoptimizes where it shouldn't need to).

That's still slower than MRI but I think it's significant progress and it fixes the regression we had which was mostly due to slower RSTRING_PTR() accesses.

chrisseaton commented 4 years ago

Thanks for this important work, @eregon. This halves time to bundle install a major production application at Shopify.

You can see in this trace (blue is before, green is after) that we're compiling more methods more quickly now, with fewer cancelled methods. More methods compiled may be due to more deoptimisations. Less methods are cancelled, and we do now manage to drain the compilation queue before the application finishes!

We also now reach a steady state of optimisation 'temperature'.

It does introduce a memory spike from around 1 GB up to around 10 GB for a few seconds.

figure_1

Can you share anything about your methodology for making these changes?

eregon commented 4 years ago

That's a great view, thanks for sharing.

Re methodology I mostly used jt profile on native image builds, which produces a flamegraph. I used @aardvark179's work in progress for a flamegraph also showing an histogram and with coloring to identity Ruby/C methods and compiled/interpreted methods. Then I basically looked at the top self times in that histogram and figured what I could do to speed that method. I also used jt metrics time to get an idea of how much time is spent in GC and based on that tweaked the young generation size when running as a Native Image.

I found --engine.Mode=latency based on Graal.js' npm using it. The main change it does is disabling splitting and inlining, which for such a workload seems very beneficial to get more compiled faster.

If anyone wants to profile it more, here are the commands I used:

Between each run:
$ rm -rf gem_home 
and make sure to set GEM_HOME to never install globally (otherwise rebuild)

Run with GC info for SVM, bigger young gen and optimizing for latency:
$ GEM_HOME=$PWD/gem_home jt -u native ruby --vm.Xmn1g --vm.XX:+PrintGCSummary --experimental-options --engine.Mode=latency -S bundle install -j1

CPUProfile FlameGraph:
GEM_HOME=$PWD/gem_home jt -u native profile -S bundle install -j1

Metrics FlameGraph (visualizes GC pauses and require time):
$ TRUFFLERUBY_METRICS_REPS=1 GEM_HOME=$PWD/gem_home jt -u native metrics time --flamegraph --vm.Xmn1g --experimental-options --metrics-time-parsing-file --metrics-time-require -S bundle install -j1
petenorth commented 4 years ago

@chrisseaton did you see the halving of the bundle install times after merging @eregon changes i.e. RSTRING_PTR() optimizations and --engine.Mode=latency or just by updating the bundle executable to run with the --engine.Mode=latency ?

The reason I ask is that on 19.3.0.2 simply by updating the /opt/graalvm-ce-java8-19.3.0.2/bin/bundle to run as

exec "$(dirname $SELF_PATH)/ruby" "--experimental-options" "--engine.Mode=latency" "$SELF_PATH" "$@"

halves the bundle install time for a Rails 5.2.3 bundle install(gem 'rails', '~> 5.2.1')

It would be great if we could expect further speed up with 20.0.0 once the RSTRING_PTR() optimizations are included?

chrisseaton commented 4 years ago

My result is from merging with master a couple of days ago, so the latency mode and string pointer changes.

eregon commented 4 years ago

@petenorth the optimization/bug fix for RSTRING_PTR() will be in 20.0, however --engine.Mode=latency and other optimizations will not because the feature freeze for 20.0 was already some time ago and those optimizations do not fix compatibility bugs. We're getting very close to providing nightly builds which should make it easy to try the latest changes (#1483).

petenorth commented 4 years ago

@eregon OK, I can easily add them to 20.0 locally once released and then see how a bundle install for rails compares:

version modifications timing
19.3.0.2 6 mins 13 secs
19.3.0.2 latency mode 3 mins 40 secs
19.3.0.2 latency mode and heap size 2 mins 25 secs
MRI 2.5.6 54 secs
20.0 4 mins 53 secs
20.0 latency mode 3 mins 6 secs
20.0 latency mode and heap size 3 mins 14 secs
20.1.0-dev 2 mins 10 secs

I'd previously only included latency mode to see the halving, including the heap size change results in a reduction by almost a factor 3. The biggest contribution to MRI's time is when it has to install gems with native extensions, more so than for TruffleRuby.

Generally on the nightly truffleruby head builds running via a GitHub workflow I see c. 1 min 50 secs compared to MRI taking 50 secs.

petenorth commented 4 years ago

@petenorth the optimization/bug fix for RSTRING_PTR() will be in 20.0, however --engine.Mode=latency and other optimizations will not because the feature freeze for 20.0 was already some time ago and those optimizations do not fix compatibility bugs. We're getting very close to providing nightly builds which should make it easy to try the latest changes (#1483).

Do you mean 20.1 for both as per the change log ?

eregon commented 4 years ago

@petenorth The RSTRING_PTR() optimization/bug fix was backported to the 20.0 release branch. The rest not as it's too big/risky to backport.

deepj commented 4 years ago

@eregon I noticed bundler/rubygems generates a lot of exceptions

See https://gist.github.com/deepj/9a6dccfadd621780242a698274f7396a

This was generated using RUBYOPT=-d ruby bug.rb where bug.rb comes from #1895

eregon commented 4 years ago

@deepj Good catch! Could you run the same on MRI and post the output to compare?

deepj commented 4 years ago

@eregon much less exceptions https://gist.github.com/deepj/cb29e92d8eba43129ed107c04adf3d83

eregon commented 4 years ago

@chrisseaton Could you try again your bundle install with latest truffleruby master?

1893 has been fixed and I think that was causing the memory spike above.

eregon commented 4 years ago

@deepj Regarding exceptions, I extracted it to #1906.

chrisseaton commented 4 years ago

Heap spike has gone.

image