oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.
https://www.graalvm.org/ruby/
Other
3.02k stars 185 forks source link

Sidekiq is slow on TruffleRuby #1584

Open eregon opened 5 years ago

eregon commented 5 years ago

From @nateberkopec https://twitter.com/nateberkopec/status/1096880039491133441 Instructions to reproduce at https://gist.github.com/nateberkopec/a612498e9719af6b8775ecdcb8afaa18 (time to process 20 000 empty jobs with Sidekiq).

We should investigate why it's slow. hiredis does not work yet (#1546).

@nateberkopec Does it matter whether the benchmark is run with the Rails app, or just running Sidekiq after the script created the jobs is enough? That would simplify the setup.

nateberkopec commented 5 years ago

I don't think it matters whether or not this is run in a Rails app.

EDIT: I don't have the numbers ready to copypaste, but for me, Truffle native completed the benchmark in about a minute, and MRI was about 40 seconds.

bjfish commented 4 years ago

I was running this today and here's what I found: I also saw pauses between batches of jobs, although short ones, for native but not jvm. Using the --vm.XX:+PrintGC I could confirm these were pauses for GC.

I did run with RAILS_ENV=production as run in development will do many checks for file updates.

I saw these values: MRI 2.6.5 ~ 10 seconds

TruffleRuby native ~ 65 seconds and then with some configuration: TRUFFLERUBYOPT='--vm.Xmx16G --vm.Xmn4G' ~ 34 seconds TRUFFLERUBYOPT='--vm.Xmx16G --vm.Xmn4G --experimental-options --engine.Mode=latency' ~ 26 seconds

With TruffleRuby I saw some delay in startup ( ~10-15 seconds of requiring/loading) which is an area I have been working on improving recently.

gogainda commented 3 years ago

latest results are Truffleruby head (23 sec) vs MRI Ruby 2.7 (23 sec)

eregon commented 3 years ago

On par is progress (thanks for rerunning!), but I think TruffleRuby should be faster here, especially if it's allowed to process jobs in parallel. Seems worth investigating once the new CPUSampler flamegraph support is in.

eregon commented 2 years ago

New numbers from @mperham: https://twitter.com/eregontp/status/1479534410546483201 We should investigate what's slower. Those results seem to be with toxiproxy.

mperham commented 2 years ago

Thanks. I appreciate you investigating this.

eregon commented 2 years ago

I tried running the benchmark but quickly run into non-fatal errors which happen regularly. It also happens on CRuby although less often as CRuby runs the benchmark for less long. https://github.com/mperham/sidekiq/issues/5137

There is also another type of error when running on TruffleRuby:

2022-01-24T12:33:58.904Z pid=138840 tid=fzk ERROR: Error fetching job: Interrupted system call
2022-01-24T12:33:58.905Z pid=138840 tid=fts WARN: Errno::EINTR: Interrupted system call
2022-01-24T12:33:58.905Z pid=138840 tid=ftc WARN: exception.c:61:in `rb_syserr_fail'
exception.c:61:in `rb_syserr_fail'
exception.c:72:in `rb_sys_fail'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/hiredis-0.6.3/ext/hiredis_ext/connection.c:423:in `__get_reply'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/hiredis-0.6.3/ext/hiredis_ext/connection.c:458:in `connection_read'
/home/eregon/.rubies/truffleruby-dev/lib/truffle/truffle/cext_ruby.rb:41:in `read'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/hiredis-0.6.3/lib/hiredis/ext/connection.rb:19:in `read'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/connection/hiredis.rb:55:in `read'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:280:in `block in read'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:268:in `io'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:279:in `read'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:131:in `block in call'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:248:in `block (2 levels) in process'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:389:in `ensure_connected'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:238:in `block in process'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:325:in `logging'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:237:in `process'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:131:in `call'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:226:in `block in call_with_timeout'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:300:in `with_socket_timeout'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis/client.rb:225:in `call_with_timeout'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis.rb:1224:in `block in _bpop'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis.rb:69:in `block in synchronize'
/home/eregon/.rubies/truffleruby-dev/lib/mri/monitor.rb:218:in `mon_synchronize'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis.rb:69:in `synchronize'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis.rb:1221:in `_bpop'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/redis-4.2.5/lib/redis.rb:1266:in `brpop'
/home/eregon/code/sidekiq/lib/sidekiq/fetch.rb:47:in `block in retrieve_work'
/home/eregon/code/sidekiq/lib/sidekiq.rb:100:in `block in redis'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/connection_pool-2.2.5/lib/connection_pool.rb:63:in `block (2 levels) in with'
<internal:core> core/thread.rb:94:in `handle_interrupt'
<internal:core> core/thread.rb:94:in `handle_interrupt'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/connection_pool-2.2.5/lib/connection_pool.rb:62:in `block in with'
<internal:core> core/thread.rb:94:in `handle_interrupt'
<internal:core> core/thread.rb:94:in `handle_interrupt'
/home/eregon/code/sidekiq/vendor/bundle/truffleruby/3.0.2.9/gems/connection_pool-2.2.5/lib/connection_pool.rb:59:in `with'
/home/eregon/code/sidekiq/lib/sidekiq.rb:97:in `redis'
/home/eregon/code/sidekiq/lib/sidekiq/fetch.rb:47:in `retrieve_work'
/home/eregon/code/sidekiq/lib/sidekiq/processor.rb:83:in `get_one'
/home/eregon/code/sidekiq/lib/sidekiq/processor.rb:95:in `fetch'
/home/eregon/code/sidekiq/lib/sidekiq/processor.rb:77:in `process_one'
/home/eregon/code/sidekiq/lib/sidekiq/processor.rb:68:in `run'
/home/eregon/code/sidekiq/lib/sidekiq/processor.rb:67:in `run'
/home/eregon/code/sidekiq/lib/sidekiq/util.rb:56:in `watchdog'
/home/eregon/code/sidekiq/lib/sidekiq/util.rb:65:in `block in safe_thread'

This might be a bug in hiredis to not handle EINTR correctly (EINTR should always be retried). UPDATE: actually it's an issue of rb_thread_fd_select() in TruffleRuby => GR-36556. That might not happen on CRuby due to Thread.handle_interrupt(Exception => :never), not sure. And this not only creates errors but also prevents profiling, because profiling needs to know the thread backtraces (like Thread#backtrace) and that needs to interrupt system calls with SIGVTALRM, very similar to what CRuby does for e.g. Ctrl+C.

gogainda commented 2 years ago

FYI, in my recent tests not using hiredis improves a bit overall performance. Not sure if using it brings any benefits to Redis on Truffleruby

gogainda commented 2 years ago

I see different numbers now: 45s vs 21s . Still much slower but difference is not x4 anymore, which is good UPDATE: I rerun it without hiredis and saw similar results 45s vs 14s, so in case of TruffleRuby it doesn't make much difference

eregon commented 2 years ago

Right, we should look at this again. I was focused on Ruby 3 keyword arguments and other things but that's done now, and the fix has been merged in sidekiq (https://github.com/mperham/sidekiq/commit/1efbaddcefb75a958872f73b5614ea91ddf35be6).

Actually on master sidekiqload now doesn't use hiredis, so I think we should start with that, hiredis likely only adds complexity. For instance there is an issue with interruption of rb_thread_fd_select() in TruffleRuby, only happens with hiredis (GR-36556), and that prevents profiling (obviously we should fix that, but good if it doesn't block this, and it doesn't seem so easy).

aardvark179 commented 2 years ago

I've been taking a look at this./ TruffleRUby is faster at the JSON generation required for creating jobs,. but it's slower at IO for a variety of reasons. The JSON generation could likely be made faster by optimising Float#to_s which is most of the time spent on the JSON generation, and I'm looking at using poll instead of select for IO waiting. to reduce the overheads in that part of the code.

aardvark179 commented 2 years ago

We also appear to be much slower at parsing the job json, so validating the job by dumping it to json and parsing it back is not working so well on TruffleRuby.