hanabi1224 / Programming-Language-Benchmarks

Yet another implementation of computer language benchmarks game
https://programming-language-benchmarks.vercel.app/
MIT License
652 stars 133 forks source link

Update to TruffleRuby 21.1.0 #111

Closed eregon closed 3 years ago

eregon commented 3 years ago

TruffleRuby 21.1.0 was released: https://github.com/oracle/truffleruby/releases/tag/vm-21.1.0 Could you update to the new version and run the benchmarks? I think there should be some good improvements there :)

eregon commented 3 years ago

It looks like it's already updated (https://programming-language-benchmarks.vercel.app/ruby). Sorry, I was looking at a cached page I guess.

Looks like fasta and prime-sieve didn't run for some reason.

hanabi1224 commented 3 years ago

@eregon Those 2 are manually disabled due to extremely bad performance. Even with latest 21.1

eregon commented 3 years ago

That's weird, because for me locally for fasta:

$ ruby fasta-6.rb 2500000 >/dev/null
2.7.2: 3.071
21.0.0: 21.545
21.1.0: 4.951

So 21.1.0 is much better here, and quite closer to CRuby 2.7.2.

If I don't do the >/dev/null and let it output to my terminal:

$ ruby fasta-6.rb 2500000
2.7.2: 4.642
21.0.0: 28.985
21.1.0: 6.085

Could you try locally if you observe similar performance?

How long was the run in GitHub Actions, did it timeout somehow?

hanabi1224 commented 3 years ago

@eregon Ah u r right! Fasta is much faster with 21.1, I probably only tested the concurrent prime sieve one. It's now enabled, the site will be updated automatically when CI build is done.

BTW, the concurrent prime sieve script seems to hit deadlock or sth and takes forever to run, not sure where to report the bug tho.

eregon commented 3 years ago

Thanks.

You can report it here: https://github.com/oracle/truffleruby/issues I'm not sure if it's intentional, but that benchmark creates a new coroutine per prime number, which is very expensive. So for n=2000 that's 1+2000 Fibers/coroutines created and for n=5000 that's 5001 couroutines created (and alive at the same time).

TruffleRuby (and JRuby) currently use native threads for implementing and that's not a great fit when many Fibers are created like here: https://github.com/oracle/truffleruby/blob/master/doc/user/compatibility.md#fibers-do-not-have-the-same-performance-characteristics-as-in-mri It should be much better once Loom becomes stable and is adopted in TruffleRuby. There was a recent issue that creating many Fibers is slow, that could potentially be related: https://github.com/oracle/truffleruby/issues/2325

So for now it's probably best to keep excluding prime sieve on TruffleRuby.

I'd recommend renaming the benchmark to include "coroutine" or "coro" in the name, because computing the sieve is basically no work compared to handling all those coroutines and the communication between them. For a typical sieve implementation you would find very different results and TruffleRuby would be the fastest by far: https://github.com/jtulach/sieve#ruby-speed

hanabi1224 commented 3 years ago

I'm not sure if it's intentional

That is intentional, the problem is used on golang home page to demonstrate its goroutine performance, it's not there to demonstrate the best way of building a prime-finding program but to stress the essentials of coroutine or fiber or virtual/green threads,etc to measure performance of massive coroutine scheduling and communication.

btw, java loom implementation is also there but it's not performing well so far, when compared to kotlin coroutines (it's just in ea phase anyway).

I'd recommend renaming the benchmark to include "coroutine" or "coro" in the name

That's a good suggestion! It is indeed confusing and missleading to only have prime-sieve in its name but besides the point. I plan to add more detailed descriptions of the problems to the site, especially those not included in original benchmarks game site.

aardvark179 commented 3 years ago

@hanabi1224 I've taken a quick look at the Java implementations, and Loom is currently on par with MRI 3.0 for a single thread fork join pool and about twice as fast when using the common fork join pool on my laptop. I've taken a look with perf and most of the time is being spent in the SynchronousQueue and ForkJoinPool, the actual running and yielding of continuations is pretty good, as is the freezing and thawing of stacks. On my laptop the Kotlin coroutine version is slightly faster, but this may be due to differences in the channel implementation more than the coroutine implementation.

All this is based on a current development build of loom, which has some improvements compared to the last preview build, but this benchmark is extremely sensitive sensitive to the schedular and virtual thread setup options. Other factors like the size of the coroutine stack, or the garbage generated, might large effects on the performance of all these coroutine implementations, so I'd take a large pinch of salt with any conclusions.

hanabi1224 commented 3 years ago

@aardvark179 Thanks for the feedback!

most of the time is being spent in the SynchronousQueue

I believe so, I'm more than glad to switch from SynchronousQueue to any better solution if you can suggest. As I understand, both SynchronousQueue and kotlin channel currently being used are rendezvous queue/channel

ref: kotlin coroutine produce, channel, java SynchronousQueue

extremely sensitive to the schedular and virtual thread setup options

I haven't tried any non-default options, curious to see the best setup if you'd like to share ur findings.

the last preview build

I can switch to that if you can point me to its public pre-built binaries.

aardvark179 commented 3 years ago

@hanabi1224 Doug Lea is working on a channel implementation for java.util.concurrent (though it probably won't be called channels because NIO channels already exist…) but I don't know exactly how the performance will compare to SynchronousQueue. I haven't tried really digging into the relative performance of the channels and queues, but yes they should both be behaving as rendezvous queues, and I worry that if we tried optimising the queues too much for a benchmark like this we might break other cases.

I don't have anything definitive for the best way to configure the pool and the thread factory, the common pool seems good, but seems to have slightly different performance characteristics to the default schedular or creating separate fork join pool, I'll check with core libs about this. Some aspects might change as there is some scope for changing and optimising what needs to live on the different objects that make up a thread.

I'm afraid we don't produce pre-built libraries except for EA builds, but I'm sure there will be another one before too long.

hanabi1224 commented 3 years ago

Doug Lea is working on a channel implementation

Awesome!

I'm sure there will be another one before too long

I will keep an eye on it and have a try as soon as it comes out

And also I will try to keep the code updated to utilize the best-fit loom api as it involves.