Closed faraazahmad closed 7 months ago
Thank you for the report, we'll investigate. What changed compared to https://ruby.social/@faraaz/111217973733226631 where you mentioned 60 seconds for 23.0?
Over there i was running the benchmarks manually so I was parsing each file 5 times, to see how big of a time difference there is, hence the 60 seconds
Here are the results I got, discourse=5d632fd30a50747ec9d1d1f0940428a15df781a2. With:
source "https://rubygems.org"
gem 'parallel'
gem 'prism'
require 'parallel'
require 'prism'
require 'benchmark'
puts RUBY_DESCRIPTION
GLOB = "#{Dir.home}/code/discourse/**/*.rb"
def bench
Parallel.each(Dir.glob(GLOB), in_threads: 8) do |file|
Prism.parse_file(file)
end
end
3.times do
puts Benchmark.measure { bench }
end
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
2.021850 0.158122 2.179972 ( 2.191742)
1.910226 0.123122 2.033348 ( 2.045011)
1.934800 0.136041 2.070841 ( 2.082889)
ruby 3.3.0preview1 (2023-05-12 master a1b01e7701) [x86_64-linux]
2.037098 0.149761 2.186859 ( 2.198666)
1.922396 0.119217 2.041613 ( 2.053616)
1.983681 0.105089 2.088770 ( 2.100635)
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
59.349771 5.405589 64.755360 ( 5.632704)
51.509924 2.931996 54.441920 ( 4.769414)
51.910072 2.149475 54.059547 ( 4.772589)
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
78.095965 5.389323 83.485288 ( 6.333837)
52.825928 3.358561 56.184489 ( 4.465966)
49.908065 3.105139 53.013204 ( 4.212045)
49.285708 3.119749 52.405457 ( 4.172708)
42.386188 2.482132 44.868320 ( 4.026233)
31.300244 1.631678 32.931922 ( 3.646213)
28.311627 1.933647 30.245274 ( 3.637510)
24.894860 1.711678 26.606538 ( 3.543606)
24.876030 1.451286 26.327316 ( 3.287879)
24.713629 1.640005 26.353634 ( 3.416895)
My CRuby results are close to yours. However TruffleRuby Native results are much faster (~3x), which is most likely due to TruffleRuby Native using Native Image G1 on Linux and that's not available on macOS yet for Native Image. TruffleRuby JVM is also much faster (~2x) compared to your results, I'm not sure why.
Do you have 8 physical cores? I guess probably 4 which could explain that last difference. On my linux machine I have 8 physical cores. In general it's best to not count hyperthreads as those provide very little parallelism.
Talking of parallelism, I also ran the benchmark without Parallel.each
:
require 'parallel'
require 'prism'
require 'benchmark'
puts RUBY_DESCRIPTION
GLOB = "#{Dir.home}/code/discourse/**/*.rb"
def bench
Dir.glob(GLOB) do |file|
Prism.parse_file(file)
end
end
3.times do
puts Benchmark.measure { bench }
end
and that gives:
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
1.959116 0.133027 2.092143 ( 2.104274)
1.856273 0.124001 1.980274 ( 1.991644)
1.869340 0.115209 1.984549 ( 1.995774)
ruby 3.3.0preview1 (2023-05-12 master a1b01e7701) [x86_64-linux]
1.935369 0.131056 2.066425 ( 2.078142)
1.848565 0.103314 1.951879 ( 1.962795)
1.846412 0.107251 1.953663 ( 1.964553)
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
45.736257 2.614669 48.350926 ( 12.628425)
15.363574 0.553676 15.917250 ( 10.551924)
12.899212 0.578461 13.477673 ( 10.534125)
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
74.832125 3.108128 77.940253 ( 14.254892)
29.320961 1.172517 30.493478 ( 11.530793)
19.883169 0.831649 20.714818 ( 11.318680)
14.190805 0.639462 14.830267 ( 11.210458)
11.789571 0.529738 12.319309 ( 11.132813)
12.350481 0.629661 12.980142 ( 11.234333)
We see basically no difference on CRuby, as expected due to the GIL. We see a speedup on TruffleRuby (2.2x on Native, 3.2x on JVM) but it's not linear to the number of cores. I think in this case it is due to this benchmark being very heavy on allocations, and while TruffleRuby/the JVM can allocate in parallel, the GC (well, except SerialGC on SVM on macOS) runs in parallel regardless of the number of Ruby threads. So the GC has the same amount of work with more Ruby threads but does not get faster (or not much, it might help to have multiple TLAB though). So that is the likely bottleneck for scaling here.
That leaves us with why this workload is quite a bit slower on TruffleRuby, I'll keep digging, it's probably something related to deserialization in Prism.
Here is a flamegraph from 23.1 Native with be ruby --cpusampler=flamegraph bench.rb 3
(and without parallel):
To get the interactive version you need to download the svg file and open it in a browser, GitHub disables the JavaScript parts of it.
The most obvious thing there is we see 35% of the time is spent in Prism::Source#compute_offsets
which is something the Prism C extension on CRuby does not need to do, instead it uses the offsets from the C parser. We should serialize those offsets in Prism so they don't need to be recomputed when deserializing.
Something else we see when pressing c
twice to see the warmup colors is that load_node
is never compiled.
That's because of:
$ be ruby --engine.TraceCompilation bench.rb 3 |& grep 'opt failed'
[engine] opt failed id=12203 Prism::Serialize::Loader#load_node |Tier 1|Time 344(-18388702+18389045)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 113349. Graph Size: 150001. Limit: 150000.|Timestamp 18389045465543|Src serialize.rb:202
[engine] opt failed id=12203 Prism::Serialize::Loader#load_node |Tier 1|Time 328(-18389046+18389374)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 113233. Graph Size: 150007. Limit: 150000.|Timestamp 18389373861547|Src serialize.rb:202
So that's a big bottleneck, as the interpreter is much slower than JITed code, and also it allocates a lot more.
load_node
looks like this: https://gist.github.com/eregon/a0ed168abcc6c1671d85d9199ff49b72#file-serialize-rb-L203-L493 (this is from prism master FWIW)
And TruffleRuby does not currently optimize case n when 1 body1 when 2 body2 when 3 body3
to jump to body3 without checking n == 1 and n == 2 before.
I'll try a different approach in Prism and see if that helps.
Using the simplified single-threaded benchmark (no parallel
gem, so no need to run this with Bundler, so it can be run like ruby -I../prism/lib bench.rb 3
):
require 'prism'
require 'benchmark'
puts RUBY_DESCRIPTION
GLOB = "#{Dir.home}/code/discourse/**/*.rb"
def bench
Dir.glob(GLOB) do |file|
Prism.parse_file(file)
end
end
N = Integer(ARGV.first || 3)
N.times do
puts Benchmark.measure { bench }
end
and using prism master. I get:
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
46.107483 3.218483 49.325966 ( 12.917207)
18.562016 0.626841 19.188857 ( 10.795933)
12.500946 0.552576 13.053522 ( 10.574041)
(basically the same as above)
And by changing load_node to be a lambda per node type I get:
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
61.172579 2.250903 63.423482 ( 8.873968)
37.859864 2.043808 39.903672 ( 7.481593)
28.697565 0.795674 29.493239 ( 6.525924)
16.106719 0.630867 16.737586 ( 6.211813)
13.460437 0.432854 13.893291 ( 6.023979)
7.502166 0.442928 7.945094 ( 5.842586)
9.473145 0.528005 10.001150 ( 5.932664)
8.603267 0.413352 9.016619 ( 5.841062)
7.952528 0.433553 8.386081 ( 5.785633)
6.912824 0.363138 7.275962 ( 5.723259)
Almost a 2x speedup!
And no more opt failed
.
Here is the flamegraph, with ruby -I../prism/lib --cpusampler=flamegraph bench.rb 3
:
Now everything hot in the histogram is compiled and 60% is spent in Prism::Source#compute_offsets
.
by changing load_node to be a lambda per node type
I'm curious as to how this helps. Does it help us get around this limitation?:
And TruffleRuby does not currently optimize case n when 1 body1 when 2 body2 when 3 body3 to jump to body3 without checking n == 1 and n == 2 before.
I've pushed my changes so far to https://github.com/ruby/prism/compare/main...eregon:yarp:tweaks.
I'm curious as to how this helps. Does it help us get around this limitation?
Yes, since the lambdas are in a Array so that's just O(1) Array#[].
But also the lambdas are small enough that they can be JIT-compiled, the 300-lines load_node
was too big to be JIT-compiled.
About Prism::Source#compute_offsets
I'm not convinced that's the biggest bottleneck because I also did a small benchmark doing only that part:
require 'prism'
require 'benchmark'
puts RUBY_DESCRIPTION
GLOB = "#{Dir.home}/code/discourse/**/*.rb"
def bench
Dir.glob(GLOB) do |file|
code = File.binread(file)
Prism::Source.new(code)
end
end
N = Integer(ARGV.first || 3)
N.times do
puts Benchmark.measure { bench }
end
But that is much faster and does not take 60% of the ~5.7 seconds/iteration above:
$ ruby -I../prism/lib compute_offsets.rb 10
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
4.779324 0.474053 5.253377 ( 0.766970)
3.351564 0.134098 3.485662 ( 0.499833)
1.764714 0.208363 1.973077 ( 0.286704)
1.472017 0.086878 1.558895 ( 0.224722)
2.584139 0.075350 2.659489 ( 0.377355)
2.326343 0.271429 2.597772 ( 0.380175)
1.767078 0.085944 1.853022 ( 0.264943)
1.954663 0.088937 2.043600 ( 0.282534)
1.541035 0.083524 1.624559 ( 0.230307)
1.543260 0.074147 1.617407 ( 0.228676)
So I wonder if the profiler is misleading there.
But it could also be a more complicated interaction where the Regexp used for compute_offsets
is JIT-compiled late because there is a lot more things to compile on the full benchmark.
In any case it seems worth to avoid recomputing the line offsets again, so I'll work on a way to serialize+deserialize those.
Not sure if this is helps, but if I just create new Source
, like you did in the last example, Array#<<
barely takes any time. However, if I get to this method by calling Prism#parse
only then does it take this long, my amateur guess is that is since its going through call chain: Prism#parse
-> Prism#load
-> Serialize#load
-> Source#new
, theres a lot of string allocation happening here, could it be an issue with insufficient TLAB space?
It definitely takes slightly longer, when I use Parallel.map
vs Parallel.each
, I'm guessing because in Parallel.map
, it stores the result to return it to the main thread. Here's some benchmarks:
Parallel.each
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]
86.077456 10.028982 96.106438 ( 13.411521)
52.458730 9.163018 61.621748 ( 8.776318)
50.073878 9.241329 59.315207 ( 8.576140)
50.344392 9.272874 59.617266 ( 8.583368)
45.094790 9.200160 54.294950 ( 8.307919)
Parallel.map
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]
117.025375 11.182852 128.208227 ( 17.811838)
82.708486 10.822153 93.530639 ( 13.395469)
79.538424 12.084877 91.623301 ( 13.385311)
72.459464 11.287978 83.747442 ( 12.670169)
71.959355 10.746600 82.705955 ( 12.457304)
and if I simply use the Thread
class and store the result of Prism.parse
using Thread.current[]=
, it takes even longer:
require 'parallel'
require 'prism'
require 'benchmark'
puts RUBY_DESCRIPTION
GLOB = "#{Dir.home}/oss/discourse/**/*.rb"
def bench_prism
Parallel.map(Dir.glob(GLOB), in_threads: 4) do |file|
Prism.parse_file(file)
end
end
def bench_prism_thread
threads = []
Dir.glob(GLOB).each do |file|
threads << Thread.new { Thread.current[:tree] = Prism.parse_file(file) }
end
threads.each(&:join)
end
n = Integer(ARGV[0] || 3)
n.times do
puts Benchmark.measure { bench_prism_thread }
end
results:
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]
149.042894 16.755828 165.798722 ( 23.293313)
81.487260 25.510818 106.998078 ( 18.358081)
80.045864 23.758906 103.804770 ( 15.674315)
73.355320 29.309650 102.664970 ( 17.059532)
70.558519 25.700033 96.258552 ( 15.057084)
Following is similar to the above but just without Thread.current[]=
:
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]
170.567457 17.339795 187.907252 ( 25.313308)
76.937372 19.420802 96.358174 ( 13.899857)
71.617192 17.945206 89.562398 ( 12.449902)
73.299595 21.708841 95.008436 ( 13.741568)
68.685054 19.843938 88.528992 ( 12.565670)
I seem to be getting similar times in all 3 versions, when using FixedThreadPool
from concurrent
gem:
require 'parallel'
require 'prism'
require 'benchmark'
require 'concurrent'
puts RUBY_DESCRIPTION
GLOB = "#{Dir.home}/oss/discourse/**/*.rb"
def thread_pool
pool = Concurrent::FixedThreadPool.new(4)
Dir.glob(GLOB).each do |file|
pool.post { Prism.parse_file(file) }
end
pool.shutdown
end
n = Integer(ARGV[0] || 3)
n.times do
puts Benchmark.measure { thread_pool }
end
ruby 3.3.0preview2 (2023-09-14 master e50fcca9a7) [x86_64-darwin22]
0.053338 0.098111 0.151449 ( 0.151505)
2.020494 0.264774 2.285268 ( 2.286550)
1.847566 0.235160 2.082726 ( 2.083523)
2.036501 0.249373 2.285874 ( 2.287998)
1.811042 0.232954 2.043996 ( 2.044978)
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]
8.664981 0.387264 9.052245 ( 1.492148)
8.049466 0.674221 8.723687 ( 1.126538)
9.955973 0.848196 10.804169 ( 1.389465)
7.546957 0.678019 8.224976 ( 1.063006)
11.698075 1.076263 12.774338 ( 1.645159)
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-darwin]
2.262696 0.141141 2.403837 ( 0.570721)
4.783651 0.380178 5.163829 ( 1.037267)
6.059194 0.504666 6.563860 ( 1.324881)
7.596199 0.825799 8.421998 ( 1.777206)
8.759423 1.159531 9.918954 ( 1.962771)
Is this legit or am I exiting before all the execution can complete?
I think shutdown will not wait for the threads or tasks to be done: https://ruby-concurrency.github.io/concurrent-ruby/master/Concurrent/ThreadPoolExecutor.html#shutdown-instance_method
You'd probably need wait_for_termination
for that.
From https://github.com/ruby/prism/issues/2402#issuecomment-1942061414
https://github.com/ruby/prism/pull/2381 helps a ton, it's around 2.7s per iteration before (and also faster with YJIT).
Such a big speedup was unexpected, so I investigated.
The problem is before https://github.com/ruby/prism/pull/2381 both the filename String and the contents String were copied multiple times from managed byte[] to native char* and back.
This is because pm_serialize_parse
is declared with :pointer
argument and we pass it a Ruby String.
And in that case TruffleRuby (at least currently) mutates the String in-place to use a native representation (Truffle::CExt.string_to_ffi_pointer
).
But then just after serialization is done there would be a copy due to some .b
call which uses dup
and internally String#initialize_copy for a native string makes a copy of it (needs to as it is a MutableTruffleString).
And then later it would need to be managed again, for Regexp matching.
Changing the code in Prism ffi.rb to avoid using that Ruby String for pm_serialize_parse for the Prism.parse_file case it looks a lot better. On 24aae7aa51ce2877adc0718cbe6cedca532f01fe (before https://github.com/ruby/prism/pull/2381):
truffleruby 24.1.0-dev-93693825, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
$ ruby -Ilib -rprism -rbenchmark -e '30.times { p Benchmark.realtime { Dir.glob("lib/**/*.rb") { |f| Prism.parse_file(f) } } }'
5.08163639200211
3.2185959659982473
2.7246216619969346
2.493555128996377
2.5304376729982323
2.626141545995779
2.561074762001226
2.6626226649968885
2.451639045000775
2.442975819998537
2.37616028800403
2.342963579998468
2.492586234002374
2.4081202760062297
2.2901282259990694
2.247561003001465
2.216280453998479
2.2374196180026047
2.20327866399748
2.1944359460030682
2.227331458001572
2.290716622999753
2.23880834600277
2.266132271004608
2.235398062999593
2.2192147110035876
2.2535331350009074
2.1951458570038085
2.170285389998753
2.146207225996477
Still 24aae7aa51ce2877adc0718cbe6cedca532f01fe but with the change in ffi.rb:
2.47568983000383
0.568289217997517
0.30265256599523127
0.38206838199403137
0.46137602499948116
0.3988921729978756
0.31418989599478664
0.19933188000140944
0.20126643000548938
0.2556454750010744
0.3183238799974788
0.22124192000046605
0.20216184399760095
0.1891287110047415
0.18731482300063362
0.21746326200081967
0.14462887900299393
0.1365925350037287
0.17602010799600976
0.14211552999768173
0.13566503999754786
0.23790449000080116
0.14238863500213483
0.12545616099669132
0.1537287609971827
0.1526584540042677
0.1223834429983981
0.12089217000175267
0.15295368800434517
0.11997614600113593
CRuby 3.3:
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [x86_64-linux]
$ ruby --yjit -Ilib -rprism -rbenchmark -e '10.times { p Benchmark.realtime { Dir.glob("lib/**/*.rb") { |f| Prism.parse_file(f) } } }'
0.14406701100233477
0.0955199759991956
0.10004908599512419
0.10911789700185182
0.10410506899643224
0.09660588300175732
0.11423506699793506
0.09826613799668849
0.08650715499970829
0.10329519699735101
$ PRISM_FFI_BACKEND=1 ruby --yjit -Ilib -rprism -rbenchmark -e '10.times { p Benchmark.realtime { Dir.glob("lib/**/*.rb") { |f| Prism.parse_file(f) } } }'
0.24653413399937563
0.2043960579976556
0.16812661400035722
0.1922464890012634
0.1833490009958041
0.16330030000244733
0.19229596100194613
0.1730906020020484
0.17741290900448803
0.17981221600348363
And in that case TruffleRuby (at least currently) mutates the String in-place to use a native representation (
Truffle::CExt.string_to_ffi_pointer
).
I checked and this is necessary for the FFI :pointer
type but not for the :string
type: https://github.com/ffi/ffi/issues/1080
So one good fix there is to change TruffleRuby's FFI backend to not mutate the Ruby String for the :string
type.
It also needs changes in Prism so that Prism uses :string instead of :pointer for const char*
, that's trivial.
Argh that idea sort of falls apart because
PRISM_EXPORTED_FUNCTION void pm_serialize_parse(pm_buffer_t *buffer, const uint8_t *source, size_t size, const char *data);
So both source
and data
could be :string
, but that's a problem because :string
in FFI only accepts a String with no null bytes ("\0"), and data
has null bytes.
We could workaround by manually declaring the types, but it's not so nice.
It'd be nice if FFI would let String with null bytes through or have another type allowing it: https://github.com/ffi/ffi/issues/1081
With this change in Prism:
diff --git a/lib/prism/ffi.rb b/lib/prism/ffi.rb
index 3418d787..639185f3 100644
--- a/lib/prism/ffi.rb
+++ b/lib/prism/ffi.rb
@@ -19,14 +19,23 @@ module Prism
# Convert a native C type declaration into a symbol that FFI understands.
# For example:
#
- # const char * -> :pointer
+ # const char * -> :string
+ # pm_buffer_t *-> :pointer
# bool -> :bool
# size_t -> :size_t
# void -> :void
#
- def self.resolve_type(type)
+ def self.resolve_type(type, return_type)
type = type.strip
- type.end_with?("*") ? :pointer : type.delete_prefix("const ").to_sym
+ if !return_type && (type == "const char *" || type == "const uint8_t *")
+ # This type avoids extra copies when passing Ruby Strings arguments to native code.
+ # We don't use it for the return type because we need the pointer e.g. for pm_string_source().
+ :string
+ elsif type.end_with?("*")
+ :pointer
+ else
+ type.delete_prefix("const ").to_sym
+ end
end
# Read through the given header file and find the declaration of each of the
@@ -56,10 +65,14 @@ module Prism
# Resolve the type of the argument by dropping the name of the argument
# first if it is present.
- arg_types.map! { |type| resolve_type(type.sub(/\w+$/, "")) }
+ arg_types.map! { |type| resolve_type(type.sub(/\w+$/, ""), false) }
+
+ if name == 'pm_serialize_parse'
+ arg_types = [:pointer, :string, :size_t, :pointer]
+ end
# Attach the function using the FFI library.
- attach_function name, arg_types, resolve_type(return_type)
+ attach_function name, arg_types, resolve_type(return_type, true)
end
# If we didn't find all of the functions, raise an error.
And fixes in TruffleRuby to optimize :string
, we get the expected performance:
$ ruby -Ilib -rprism -rbenchmark -e '30.times { p Benchmark.realtime { Dir.glob("lib/**/*.rb") { |f| Prism.parse_file(f) } } }'
...
0.11793946800025878
0.11649627699807752
0.11632051299966406
(vs ~2.5 before, this is on jvm-ee)
Using the single-thread benchmark from https://github.com/oracle/truffleruby/issues/3293#issuecomment-1759790280 and https://github.com/ruby/prism/pull/2426
$ ruby -I../prism/lib bench.rb
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
2.190895 0.117062 2.307957 ( 2.321824)
2.141946 0.133974 2.275920 ( 2.290421)
2.233112 0.120049 2.353161 ( 2.368065)
$ ruby -I../prism/lib bench.rb 10
truffleruby 24.1.0-dev-93693825, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
28.447562 1.234428 29.681990 ( 4.226516)
18.374132 0.209007 18.583139 ( 2.654558)
17.659575 0.203733 17.863308 ( 2.496900)
18.335423 0.628889 18.964312 ( 2.595021)
19.378060 0.839551 20.217611 ( 2.658917)
17.002796 0.214466 17.217262 ( 2.470285)
15.576381 0.179887 15.756268 ( 2.257773)
11.338724 0.178438 11.517162 ( 2.168090)
8.109505 0.173258 8.282763 ( 2.157561)
6.059230 0.138583 6.197813 ( 2.138763)
So it's about the same performance as CRuby 3.3.0 single-threaded.
Now with the parallel benchmark from https://github.com/oracle/truffleruby/issues/3293#issuecomment-1759636133:
$ ruby -I../prism/lib bench_parallel.rb 5
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
2.244563 0.164503 2.409066 ( 2.425282)
2.192646 0.130990 2.323636 ( 2.339039)
2.047115 0.147734 2.194849 ( 2.206869)
2.246744 0.125164 2.371908 ( 2.387293)
2.127249 0.138846 2.266095 ( 2.280787)
$ ruby -I../prism/lib bench_parallel.rb 10
truffleruby 24.1.0-dev-93693825, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
22.552076 1.075381 23.627457 ( 1.969642)
14.322148 0.595175 14.917323 ( 1.229991)
9.793991 0.343925 10.137916 ( 0.825214)
8.694108 0.291115 8.985223 ( 0.760018)
8.244202 0.221496 8.465698 ( 0.700302)
7.944729 0.268573 8.213302 ( 0.689711)
8.077947 0.286265 8.364212 ( 0.697297)
7.915530 0.271384 8.186914 ( 0.668313)
7.710315 0.256563 7.966878 ( 0.649437)
7.871159 0.258862 8.130021 ( 0.671472)
$ ruby -I../prism/lib bench_parallel.rb 30
truffleruby 24.1.0-dev-0e0df31f, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
...
5.939668 0.246157 6.185825 ( 0.497786)
6.813118 0.250171 7.063289 ( 0.569874)
6.008750 0.352802 6.361552 ( 0.514425)
5.993626 0.230494 6.224120 ( 0.515127)
6.387444 0.261654 6.649098 ( 0.531733)
5.805497 0.216181 6.021678 ( 0.490501)
5.956375 0.254065 6.210440 ( 0.508663)
5.919400 0.244198 6.163598 ( 0.502847)
So that's about 3.5x-5x as fast! (with 8 threads) It's not linear scaling, I suspect due to the reasons mentioned at the end of https://github.com/oracle/truffleruby/issues/3293#issuecomment-1759636133.
Out of curiosity I also tried with in_processes: 8
on CRuby, and that gives timings like 0.515503.
I've got a file where I'm trying to parallely parse all the Ruby files in the discourse codebase.
I am also using the Prism and Parallel gem to do so. Here's my Gemfile.lock:
I ran this code using CRuby 3.3.0-preview1, truffleruby-23.0.0 (native) truffleruby-23.1.0 (native). Here are the results:
I threw in the GraalVM version for good measure and it seemed to be JIT compiling pretty well but is still slower than CRuby.
TruffleRuby seems considerably slower than CRuby in these benchmarks. Is this a bug?