Closed eregon closed 6 months ago
Running just the gem
command by itself takes 2700 ms on my system under truffleruby.
By comparison the gem command from MRI takes 100 ms.
With 1.0.0-rc5, on my system, gem
takes around 0.768s (vs 0.091s for MRI 2.5.1).
I'm trying to investigate where the time is spent.
It seems the time of gem
or gem help
depends on the number of gems installed, so that might explain much larger time for you if you have many gems installed.
I extended the time metrics we have to detail how time is spent in require
, and created a FlameGraph from that (click the image for the interactive version):
Some interesting potentials for optimization there:
cext.rb
(125ms) and loading ruby.su
(44ms).uri/common.rb
(when require-ing it) is quite long (35ms)--cpusampler
and create another FrameGraph at the level of method calls.Loading the C-ext support is long, due to parsing
cext.rb
Didn't we used to have cext.rb
as part of core? And in there it would now be parsed during image generation.
Looks like the rc5 did improve the load time.
root@a8ae5af17715:/app# time gem
RubyGems is a sophisticated package manager for Ruby. This is a
basic help message containing pointers to more information.
(removed content)
http://guides.rubygems.org
real 0m1.356s
user 0m1.760s
sys 0m0.220s
This is much improved over previous version. Note: I have 37 gems installed.
Let me know command or tool to use to produce the graph.
You can use my container image: docker run -ti --rm logicminds/truffleruby bash
(although this have no gems installed) Note: 1.5Gb image size
Didn't we used to have
cext.rb
as part of core? And in there it would now be parsed during image generation.
@chrisseaton We did, but it was suboptimal on JVM.
I think we can just load it during context pre-initialization with require
, that should be easy.
Note that on JVM the flamegraph for gem help
looks very different, and there is much more time spent loading ruby.su
(444ms) and not much parsing cext.rb
(5ms):
I'd guess this is due to having to classload Sulong for ruby.su
, and OTOH having a warmed-up parser (and so avoiding to parse cext.rb
might just move the problem around, but we should try anyway).
@logicminds Good to hear :)
I'm currently integrating the generation of the flamegraph to master
.
Unfortunately, I needed some modifications to metrics so measuring with 1.0.0-rc5 won't work.
I tried require "truffle/cext"
during pre-initialization, but it doesn't help (the long step then becomes loading ruby.su
): https://cdn.rawgit.com/eregon/243fa85e06c1597d165610b9c553e8e4/raw/8ed12db2c474566feba63e0d87b8141cc1999acc/time_metrics_flamegraph.svg
So this reveals the metrics above don't visualize GC, which happens right during loading C-ext support:
$ ruby --native.XX:+PrintGC --native.Dtruffleruby.metrics.time=true -Xmetrics.time.parsing.file=true -Xmetrics.time.require=true -S gem help
...
before-load-cext-support 1533485020198
before-load-cext-ruby.su 1533485020199
[Incremental GC (CollectOnAllocation.Sometimes) 263853K->44119K, 0.1217002 secs]
after-load-cext-ruby.su 1533485020370
after-load-cext-support 1533485020371
...
after-require-rubygems/commands/help_command 1533485020603
[Full GC (CollectOnAllocation.Sometimes) 305781K->70088K, 0.2343588 secs]
RubyGems is a sophisticated package manager for Ruby. This is a
basic help message containing pointers to more information.
...
We need to investigate why the GC takes so long here.
Making the young generation large enough to avoid GCs reduces the time quite a bit:
$ time bin/native-ruby -S gem help
1.26s user 0.10s system 133% cpu 1.024 total
$ time bin/native-ruby --native.XX:+PrintGC --native.Xmn1g -S gem help
0.94s user 0.15s system 148% cpu 0.732 total
And here is a flamegraph with no GC:
TRUFFLERUBY_METRICS_REPS=1 TRUFFLERUBY_METRICS_MIN_TIME=0.003 jt metrics time --native --flamegraph --native.Xmn1g -Xmetrics.time.parsing.file=true -Xmetrics.time.require=true -S gem help
Updated command:
TRUFFLERUBY_METRICS_REPS=1 jt -u native metrics time --flamegraph --vm.Xmn1g --experimental-options --metrics-time-parsing-file --metrics-time-require -S gem help
I tweaked the script to also include GC in the FlameGraph. Furthermore, I generated a FlameChart, a variant where the horizontal axis is time, so this shows what happens over time (left to right) instead of sorting alphabetically.
Here we can search for GC
and see it takes 32.1% of the time, which we should be able to improve upon.
What GC are you using? G1?
@logicminds It's the GC of SubstateVM, not a JVM GC. It is not yet as efficient as JVM GCs.
btw, the flame graph looks like a ship.
I filed an issue for the SubstrateVM team regarding the slow GC here (tracked internally as GR-11184).
I did more testing on TravisCI. bundle install
took 239 secondes there on a simple case. I seem to recall "Fetching gem metadata" only appeared after a long while.
Logs available at https://travis-ci.org/thbar/kiba/jobs/424857239
@thbar Thanks, that's a small Gemfile and should make it easy to reproduce.
@eregon you welcome! Yes (this is why I added my comment), and CI instances are great "constrained resources" testbeds :smile:.
Some news on this front: running bundle install -j1
for a small Gemfile in TravisCI with some profiling options reveals one pretty obvious bottleneck:
So we'll look at optimizing rb_str_set_len
now.
@eregon here one experience https://github.com/jeremyevans/sequel/pull/1576#issuecomment-442615481. Due to this experience running Sequel specs on TruffleRuby on Travis CI has been removed.
@eregon @chrisseaton @nirvdrum If I could give an advice, improving performance of rubygems
and bundler
would be a high priority. After experiences with https://github.com/jeremyevans/sequel/pull/1576#issuecomment-442615481 and now with a try to run rom
specs on Travis under TruffleRuby (it took over 21 minutes before running a first spec). See https://travis-ci.org/rom-rb/rom/jobs/462045524
This seems to be unacceptable for running tests/specs of various gems on the current underlying OSS CI infrastructure.
I've decided for now to stop adding TruffleRuby to Travis matrix of my focused gems due to the above mentioned issues.
My advice here is to focus on acceptable performance of bundler
and rubygems
since they are a underlying basis of everything in Ruby ecosystems.
I looked at this again, and while locally bundle install
takes around 30s, it takes around 300s in TravisCI.
The reason seems to be that downloading the Bundler index is very slow, an this can be reproduced locally with:
git clone https://github.com/ruby/mspec.git
cd mspec
rm -rf ~/.bundle vendor
bundle install --verbose -j1 --deployment
Profiling that shows a similar picture as above: rb_str_set_len
called from rb_gzreader_read
seems to be really slow, and I'll dig into that.
Also, I found out that Bundler seems to request information about many more gems than what seems needed: https://github.com/bundler/bundler/issues/6914
What causes rb_str_set_len
to be called so many times is that Bundler downloads the list of all gem versions, every time in TravisCI (because it's not cached under ~/.bundle
): https://github.com/bundler/bundler/issues/6915
I optimized rb_str_set_len
in 70b870069ff95bc515812f1958571a68d452e027, which should make bundle install
significantly faster in CI, because it is called thousands of times, in the inner loop of zlib
decoding.
Locally, I could measure a reduction from around 3min40s to 30-40s.
@eregon I can’t confirm so speed up of ‘bundle install`. Honestly, the speed is almost same on my machine like in the previous RC11.
@deepj Yes, it's not expected to be really faster locally yet, but on CI (at least TravisCI), where the list of all gem versions is downloaded every time, there should be a noticeable gain, particularly for smaller Gemfile.lock.
I always tried it from the scratch (all gems was removed before it)
@deepj Same here. It hasn't really gotten any better.
Another part of the problem seems to be that RC12 takes a few tries to successfully install all the gems. And I am not talking about large applications here. I mostly work on text-processing utilities and even those take a very long time for Bundler to set up under truffleruby.
For example, kramdown-asciidoc has about 50 gem dependencies for development. Starting from scratch (which, to be honest, is quite common with Bundler), it fails the first time after 2m15s. It then succeeds the second time after 45s. So we're taking almost 3 minutes to get gems installed. (Compare that to 5s for CRuby). That's going to be a big barrier to adoption.
@eregon even bundle update
is extremely slow. All gems are "cached"
$ time bundle update concurrent-ruby
Fetching gem metadata from https://rubygems.org/.........
Fetching gem metadata from https://rubygems.org/..........
Fetching gem metadata from https://rubygems.org/..........
Resolving dependencies..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Bundler could not find compatible versions for gem "concurrent-ruby":
In Gemfile:
rails was resolved to 5.2.2, which depends on
activesupport (= 5.2.2) was resolved to 5.2.2, which depends on
concurrent-ruby (>= 1.1)
backburner was resolved to 1.4.1, which depends on
concurrent-ruby (~> 1.0.1)
256.45 real 467.35 user 66.21 sys
Ignore what I tried but I'd used Rails' Gemfile
@eregon just for an interest
https://travis-ci.org/dry-rb/dry-system/jobs/503018338
$ gem update --system 101.00s $ bundle install --without tools --path=${BUNDLE_PATH:-vendor/bundle} 216.09s
Very very slow :(
It is very slow at the moment, yes. We've focused on peak performance and compatibility and not worked on cold performance in the same way. We did tackle the absolute startup time issue, but after those first hundred milliseconds it's then a different problem.
I don't think we're likely to get a fix for bundle performance in the short term, I'm afraid. We do know it's frustrating, but it's not as much of an absolute blocker as other issues.
Can someone please give an update on this issue?
I can provide an update of how I see Bundler performance. For a production application at Shopify, bundle install
takes 270s (wall-clock) on MRI, and 620s on TruffleRuby in the native configuration. That's 3x slower which is not great, I agree, but I don't see the prohibitive performance that we had in the past. And - more importantly for me - bundle install
now works on far more gems. Do you see something else using Bundler?
The 3x slower number is not universal. The dependencies for the Asciidoctor gem take considerably longer to install. 13s for CRuby vs ~300s for TruffleRuby. You can track the difference in this CI job: https://travis-ci.org/asciidoctor/asciidoctor/jobs/623359268
I'm also using Travis-CI for testing. Without caching bundle install --jobs=3 --retry=3
takes for
2.4.9: 48.98s
2.5.7: 40.88s
2.6.5: 50.48s
jruby-9.1.17.0: 40.07s
jruby-9.2.9.0: 40.22s
truffleruby-19.3.0: 543.38s
TruffleRuby is in this case at least 10x slower than other ruby versions.
CI job: https://travis-ci.org/ThomasKoppensteiner/sidekiq-undertaker/builds/625889922
@mojavelinux 13s sounds fast for me. Did you enable caching?
Thanks for the extra data points, we definitely want to improve the situation there.
It seems the relative slowdown varies a lot depending on the number of gems using C extensions. Gems with C extensions seem to install roughly as fast on TruffleRuby as most of the time is spent compiling the C extension. On the other hand we still seem to be much slower for installing pure-Ruby gems, and for fetching/downloading gems.
Caching JIT'ed code seems a general way to address startup/warmup issues (#1835).
However I think there are still several bottlenecks we could improve on significantly for bundle install
independently.
An update for this:
I optimized bundle install
on a small Gemfile with Gemfile.lock to take 19s instead of 84s previously in 50a76f32b8187207625e0e6014816e97eafd0c17, that's a nice 4.42x speedup.
Many optimizations there, including optimizing RSTRING_PTR()
accesses, using Mode=latency to make better use of compilation for this kind of workload and tweaking the GC heap size with Native Image.
Most of the execution during that time happens in the interpreter and not compiled code, so we'll want to investigate if we can do something about that (e.g., if it deoptimizes where it shouldn't need to).
That's still slower than MRI but I think it's significant progress and it fixes the regression we had which was mostly due to slower RSTRING_PTR()
accesses.
Thanks for this important work, @eregon. This halves time to bundle install
a major production application at Shopify.
You can see in this trace (blue is before, green is after) that we're compiling more methods more quickly now, with fewer cancelled methods. More methods compiled may be due to more deoptimisations. Less methods are cancelled, and we do now manage to drain the compilation queue before the application finishes!
We also now reach a steady state of optimisation 'temperature'.
It does introduce a memory spike from around 1 GB up to around 10 GB for a few seconds.
Can you share anything about your methodology for making these changes?
That's a great view, thanks for sharing.
Re methodology I mostly used jt profile
on native image builds, which produces a flamegraph.
I used @aardvark179's work in progress for a flamegraph also showing an histogram and with coloring to identity Ruby/C methods and compiled/interpreted methods.
Then I basically looked at the top self times in that histogram and figured what I could do to speed that method.
I also used jt metrics time
to get an idea of how much time is spent in GC and based on that tweaked the young generation size when running as a Native Image.
I found --engine.Mode=latency
based on Graal.js' npm
using it. The main change it does is disabling splitting and inlining, which for such a workload seems very beneficial to get more compiled faster.
If anyone wants to profile it more, here are the commands I used:
Between each run:
$ rm -rf gem_home
and make sure to set GEM_HOME to never install globally (otherwise rebuild)
Run with GC info for SVM, bigger young gen and optimizing for latency:
$ GEM_HOME=$PWD/gem_home jt -u native ruby --vm.Xmn1g --vm.XX:+PrintGCSummary --experimental-options --engine.Mode=latency -S bundle install -j1
CPUProfile FlameGraph:
GEM_HOME=$PWD/gem_home jt -u native profile -S bundle install -j1
Metrics FlameGraph (visualizes GC pauses and require time):
$ TRUFFLERUBY_METRICS_REPS=1 GEM_HOME=$PWD/gem_home jt -u native metrics time --flamegraph --vm.Xmn1g --experimental-options --metrics-time-parsing-file --metrics-time-require -S bundle install -j1
@chrisseaton did you see the halving of the bundle install
times after merging @eregon changes i.e. RSTRING_PTR()
optimizations and --engine.Mode=latency
or just by updating the bundle executable to run with the --engine.Mode=latency
?
The reason I ask is that on 19.3.0.2 simply by updating the /opt/graalvm-ce-java8-19.3.0.2/bin/bundle
to run as
exec "$(dirname $SELF_PATH)/ruby" "--experimental-options" "--engine.Mode=latency" "$SELF_PATH" "$@"
halves the bundle install
time for a Rails 5.2.3 bundle install(gem 'rails', '~> 5.2.1'
)
It would be great if we could expect further speed up with 20.0.0 once the RSTRING_PTR()
optimizations are included?
My result is from merging with master a couple of days ago, so the latency mode and string pointer changes.
@petenorth the optimization/bug fix for RSTRING_PTR()
will be in 20.0, however --engine.Mode=latency
and other optimizations will not because the feature freeze for 20.0 was already some time ago and those optimizations do not fix compatibility bugs. We're getting very close to providing nightly builds which should make it easy to try the latest changes (#1483).
@eregon OK, I can easily add them to 20.0 locally once released and then see how a bundle install
for rails compares:
version | modifications | timing |
---|---|---|
19.3.0.2 | 6 mins 13 secs | |
19.3.0.2 | latency mode | 3 mins 40 secs |
19.3.0.2 | latency mode and heap size | 2 mins 25 secs |
MRI 2.5.6 | 54 secs | |
20.0 | 4 mins 53 secs | |
20.0 | latency mode | 3 mins 6 secs |
20.0 | latency mode and heap size | 3 mins 14 secs |
20.1.0-dev | 2 mins 10 secs |
I'd previously only included latency mode to see the halving, including the heap size change results in a reduction by almost a factor 3. The biggest contribution to MRI's time is when it has to install gems with native extensions, more so than for TruffleRuby.
Generally on the nightly truffleruby head builds running via a GitHub workflow I see c. 1 min 50 secs compared to MRI taking 50 secs.
@petenorth the optimization/bug fix for
RSTRING_PTR()
will be in 20.0, however--engine.Mode=latency
and other optimizations will not because the feature freeze for 20.0 was already some time ago and those optimizations do not fix compatibility bugs. We're getting very close to providing nightly builds which should make it easy to try the latest changes (#1483).
Do you mean 20.1 for both as per the change log ?
@petenorth The RSTRING_PTR() optimization/bug fix was backported to the 20.0 release branch. The rest not as it's too big/risky to backport.
@eregon I noticed bundler/rubygems generates a lot of exceptions
See https://gist.github.com/deepj/9a6dccfadd621780242a698274f7396a
This was generated using RUBYOPT=-d ruby bug.rb
where bug.rb
comes from #1895
@deepj Good catch! Could you run the same on MRI and post the output to compare?
@eregon much less exceptions https://gist.github.com/deepj/cb29e92d8eba43129ed107c04adf3d83
@chrisseaton Could you try again your bundle install
with latest truffleruby master?
@deepj Regarding exceptions, I extracted it to #1906.
Heap spike has gone.
Currently running Bundler on TruffleRuby is very slow.
We need to profile, figure out the bottlenecks and optimize them so we get around the same speed as MRI when installing gems via Bundler.
gem install
on TruffleRuby is also not fast for installing gems, there are probably common bottlenecks between the two.