oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.35k stars 1.63k forks source link

Unexplicably slow (WRT HotSpot) PRNG methods #5165

Open vigna opened 2 years ago

vigna commented 2 years ago

While testing for speed some PRNG methods (notably, nextInt(int)) using JMH, results for GraalVM were, as expected, significantly better than HotSpot's, in some cases spectacularly. However, there is a class of methods for which GraalVM is 50% slower than HotSpot.

The JHM benchmarks are available at https://github.com/vigna/dsiutils/tree/master/prngperf

OpenJDK Runtime Environment GraalVM CE 22.3.0-dev (build 17.0.5+3-jvmci-22.3-b04) Linux Fedora 35 CPU Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz

In the table below I highlighted the 6 tests in which GraalVM (timings on the left) is 50% slower than HotSpot (timings on the right). In all other cases GraalVM is much faster.

The involved methods contain 64-bit arithmetic (in particular, moduli).

https://github.com/vigna/dsiutils/blob/924bbb27bc2494f0c4a032cbc6b5e83ff9868628/src/it/unimi/dsi/util/XoShiRo256PlusPlusRandomGenerator.java#L145

Note that the method above is nextLong(long), but nextInt(int) just delegates to that.

The problematic generators use 4 longs of state. Identical methods for generators with 2 words of state do not have this problem:

https://github.com/vigna/dsiutils/blob/924bbb27bc2494f0c4a032cbc6b5e83ff9868628/src/it/unimi/dsi/util/XoRoShiRo128PlusPlusRandomGenerator.java#L138

Please let me know if there's any additional tests I can do to clarify the matter.

BenchmarkRandom.nextDouble                   avgt   25  17.096 ± 0.060  ns/op   Benchmark                                    Mode  Cnt   Score   Error  Units
BenchmarkRandom.nextInt100000                avgt   25   8.727 ± 0.052  ns/op   BenchmarkRandom.nextDouble                   avgt   25  16.993 ± 0.081  ns/op
BenchmarkRandom.nextInt2301                  avgt   25  19.316 ± 0.095  ns/op   BenchmarkRandom.nextInt100000                avgt   25   8.493 ± 0.032  ns/op
BenchmarkRandom.nextLong                     avgt   25  17.062 ± 0.080  ns/op   BenchmarkRandom.nextInt2301                  avgt   25  19.489 ± 0.109  ns/op
BenchmarkSplitMix64.nextDouble               avgt   25   1.935 ± 0.008  ns/op   BenchmarkRandom.nextLong                     avgt   25  17.002 ± 0.074  ns/op
BenchmarkSplitMix64.nextInt100000            avgt   25   2.329 ± 0.010  ns/op   BenchmarkSplitMix64.nextDouble               avgt   25   1.942 ± 0.009  ns/op
BenchmarkSplitMix64.nextInt2301              avgt   25   2.477 ± 0.010  ns/op   BenchmarkSplitMix64.nextInt100000            avgt   25   2.185 ± 0.007  ns/op
BenchmarkSplitMix64.nextLong                 avgt   25   0.979 ± 0.005  ns/op   BenchmarkSplitMix64.nextInt2301              avgt   25   2.409 ± 0.012  ns/op
BenchmarkSplittableRandom.nextDouble         avgt   25   1.937 ± 0.007  ns/op   BenchmarkSplitMix64.nextLong                 avgt   25   1.348 ± 0.006  ns/op
BenchmarkSplittableRandom.nextInt100000      avgt   25   1.904 ± 0.007  ns/op   BenchmarkSplittableRandom.nextDouble         avgt   25   1.936 ± 0.008  ns/op
BenchmarkSplittableRandom.nextInt2301        avgt   25  11.656 ± 0.052  ns/op   BenchmarkSplittableRandom.nextInt100000      avgt   25   2.138 ± 0.009  ns/op
BenchmarkSplittableRandom.nextLong           avgt   25   0.848 ± 0.005  ns/op   BenchmarkSplittableRandom.nextInt2301        avgt   25  13.321 ± 0.053  ns/op
BenchmarkThreadLocalRandom.nextDouble        avgt   25   2.721 ± 0.016  ns/op   BenchmarkSplittableRandom.nextLong           avgt   25   1.348 ± 0.006  ns/op
BenchmarkThreadLocalRandom.nextInt100000     avgt   25   2.164 ± 0.008  ns/op   BenchmarkThreadLocalRandom.nextDouble        avgt   25   3.073 ± 0.010  ns/op
BenchmarkThreadLocalRandom.nextInt2301       avgt   25  12.909 ± 0.144  ns/op   BenchmarkThreadLocalRandom.nextInt100000     avgt   25   2.612 ± 0.010  ns/op
BenchmarkThreadLocalRandom.nextLong          avgt   25   1.056 ± 0.006  ns/op   BenchmarkThreadLocalRandom.nextInt2301       avgt   25  13.446 ± 0.082  ns/op
BenchmarkXoRoShiRo128Plus.nextDouble         avgt   25   1.939 ± 0.010  ns/op   BenchmarkThreadLocalRandom.nextLong          avgt   25   1.458 ± 0.008  ns/op
BenchmarkXoRoShiRo128Plus.nextInt100000      avgt   25   2.128 ± 0.009  ns/op   BenchmarkXoRoShiRo128Plus.nextDouble         avgt   25   1.940 ± 0.008  ns/op
BenchmarkXoRoShiRo128Plus.nextInt2301        avgt   25   2.290 ± 0.010  ns/op   BenchmarkXoRoShiRo128Plus.nextInt100000      avgt   25   2.290 ± 0.010  ns/op
BenchmarkXoRoShiRo128Plus.nextLong           avgt   25   0.909 ± 0.003  ns/op   BenchmarkXoRoShiRo128Plus.nextInt2301        avgt   25   2.324 ± 0.010  ns/op
BenchmarkXoRoShiRo128PlusPlus.nextDouble     avgt   25   1.941 ± 0.009  ns/op   BenchmarkXoRoShiRo128Plus.nextLong           avgt   25   1.801 ± 0.009  ns/op
BenchmarkXoRoShiRo128PlusPlus.nextInt100000  avgt   25   2.280 ± 0.009  ns/op   BenchmarkXoRoShiRo128PlusPlus.nextDouble     avgt   25   1.962 ± 0.007  ns/op
BenchmarkXoRoShiRo128PlusPlus.nextInt2301    avgt   25   2.372 ± 0.012  ns/op   BenchmarkXoRoShiRo128PlusPlus.nextInt100000  avgt   25   2.357 ± 0.004  ns/op
BenchmarkXoRoShiRo128PlusPlus.nextLong       avgt   25   1.056 ± 0.004  ns/op   BenchmarkXoRoShiRo128PlusPlus.nextInt2301    avgt   25   2.442 ± 0.008  ns/op
BenchmarkXoRoShiRo128StarStar.nextDouble     avgt   25   1.936 ± 0.007  ns/op   BenchmarkXoRoShiRo128PlusPlus.nextLong       avgt   25   1.884 ± 0.007  ns/op
BenchmarkXoRoShiRo128StarStar.nextInt100000  avgt   25   2.563 ± 0.017  ns/op   BenchmarkXoRoShiRo128StarStar.nextDouble     avgt   25   2.063 ± 0.009  ns/op
BenchmarkXoRoShiRo128StarStar.nextInt2301    avgt   25   2.601 ± 0.007  ns/op   BenchmarkXoRoShiRo128StarStar.nextInt100000  avgt   25   2.557 ± 0.009  ns/op
BenchmarkXoRoShiRo128StarStar.nextLong       avgt   25   1.122 ± 0.005  ns/op   BenchmarkXoRoShiRo128StarStar.nextInt2301    avgt   25   2.597 ± 0.014  ns/op
BenchmarkXoShiRo256Plus.nextDouble           avgt   25   1.939 ± 0.009  ns/op   BenchmarkXoRoShiRo128StarStar.nextLong       avgt   25   1.891 ± 0.007  ns/op
BenchmarkXoShiRo256Plus.nextInt100000        avgt   25   3.428 ± 0.014  ns/op   BenchmarkXoShiRo256Plus.nextDouble           avgt   25   1.994 ± 0.009  ns/op*
BenchmarkXoShiRo256Plus.nextInt2301          avgt   25   3.642 ± 0.144  ns/op   BenchmarkXoShiRo256Plus.nextInt100000        avgt   25   2.473 ± 0.006  ns/op*
BenchmarkXoShiRo256Plus.nextLong             avgt   25   1.608 ± 0.009  ns/op   BenchmarkXoShiRo256Plus.nextInt2301          avgt   25   2.597 ± 0.013  ns/op
BenchmarkXoShiRo256PlusPlus.nextDouble       avgt   25   1.943 ± 0.010  ns/op   BenchmarkXoShiRo256Plus.nextLong             avgt   25   1.953 ± 0.008  ns/op
BenchmarkXoShiRo256PlusPlus.nextInt100000    avgt   25   3.503 ± 0.021  ns/op   BenchmarkXoShiRo256PlusPlus.nextDouble       avgt   25   2.079 ± 0.010  ns/op*
BenchmarkXoShiRo256PlusPlus.nextInt2301      avgt   25   3.874 ± 0.026  ns/op   BenchmarkXoShiRo256PlusPlus.nextInt100000    avgt   25   2.580 ± 0.010  ns/op*
BenchmarkXoShiRo256PlusPlus.nextLong         avgt   25   1.619 ± 0.010  ns/op   BenchmarkXoShiRo256PlusPlus.nextInt2301      avgt   25   2.662 ± 0.008  ns/op
BenchmarkXoShiRo256StarStar.nextDouble       avgt   25   1.941 ± 0.013  ns/op   BenchmarkXoShiRo256PlusPlus.nextLong         avgt   25   1.919 ± 0.010  ns/op
BenchmarkXoShiRo256StarStar.nextInt100000    avgt   25   3.642 ± 0.016  ns/op   BenchmarkXoShiRo256StarStar.nextDouble       avgt   25   2.287 ± 0.009  ns/op*
BenchmarkXoShiRo256StarStar.nextInt2301      avgt   25   3.759 ± 0.020  ns/op   BenchmarkXoShiRo256StarStar.nextInt100000    avgt   25   2.675 ± 0.015  ns/op*
BenchmarkXoShiRo256StarStar.nextLong         avgt   25   1.648 ± 0.008  ns/op   BenchmarkXoShiRo256StarStar.nextInt2301      avgt   25   2.641 ± 0.010  ns/op
BenchmarkXorShift1024StarPhi.nextDouble      avgt   25   1.942 ± 0.008  ns/op   BenchmarkXoShiRo256StarStar.nextLong         avgt   25   2.042 ± 0.008  ns/op
BenchmarkXorShift1024StarPhi.nextInt100000   avgt   25   2.466 ± 0.006  ns/op   BenchmarkXorShift1024StarPhi.nextDouble      avgt   25   2.261 ± 0.008  ns/op
BenchmarkXorShift1024StarPhi.nextInt2301     avgt   25   2.610 ± 0.010  ns/op   BenchmarkXorShift1024StarPhi.nextInt100000   avgt   25   3.286 ± 0.019  ns/op
BenchmarkXorShift1024StarPhi.nextLong        avgt   25   1.337 ± 0.007  ns/op   BenchmarkXorShift1024StarPhi.nextInt2301     avgt   25   3.433 ± 0.012  ns/op
dougxc commented 2 years ago

The columns in the table are unaligned. I think this shows the difference more clearly:

                                                         GraalVM         HotSpot
Benchmark                                    Mode  Cnt   Score   Error   Score   Error  Units
BenchmarkRandom.nextDouble                   avgt   25  17.096 ± 0.060  16.993 ± 0.081  ns/op    
BenchmarkRandom.nextInt100000                avgt   25   8.727 ± 0.052   8.493 ± 0.032  ns/op    
BenchmarkRandom.nextInt2301                  avgt   25  19.316 ± 0.095  19.489 ± 0.109  ns/op    
BenchmarkRandom.nextLong                     avgt   25  17.062 ± 0.080  17.002 ± 0.074  ns/op    
BenchmarkSplitMix64.nextDouble               avgt   25   1.935 ± 0.008   1.942 ± 0.009  ns/op    
BenchmarkSplitMix64.nextInt100000            avgt   25   2.329 ± 0.010   2.185 ± 0.007  ns/op    
BenchmarkSplitMix64.nextInt2301              avgt   25   2.477 ± 0.010   2.409 ± 0.012  ns/op    
BenchmarkSplitMix64.nextLong                 avgt   25   0.979 ± 0.005   1.348 ± 0.006  ns/op    
BenchmarkSplittableRandom.nextDouble         avgt   25   1.937 ± 0.007   1.936 ± 0.008  ns/op    
BenchmarkSplittableRandom.nextInt100000      avgt   25   1.904 ± 0.007   2.138 ± 0.009  ns/op    
BenchmarkSplittableRandom.nextInt2301        avgt   25  11.656 ± 0.052  13.321 ± 0.053  ns/op    
BenchmarkSplittableRandom.nextLong           avgt   25   0.848 ± 0.005   1.348 ± 0.006  ns/op    
BenchmarkThreadLocalRandom.nextDouble        avgt   25   2.721 ± 0.016   3.073 ± 0.010  ns/op    
BenchmarkThreadLocalRandom.nextInt100000     avgt   25   2.164 ± 0.008   2.612 ± 0.010  ns/op    
BenchmarkThreadLocalRandom.nextInt2301       avgt   25  12.909 ± 0.144  13.446 ± 0.082  ns/op    
BenchmarkThreadLocalRandom.nextLong          avgt   25   1.056 ± 0.006   1.458 ± 0.008  ns/op    
BenchmarkXoRoShiRo128Plus.nextDouble         avgt   25   1.939 ± 0.010   1.940 ± 0.008  ns/op    
BenchmarkXoRoShiRo128Plus.nextInt100000      avgt   25   2.128 ± 0.009   2.290 ± 0.010  ns/op    
BenchmarkXoRoShiRo128Plus.nextInt2301        avgt   25   2.290 ± 0.010   2.324 ± 0.010  ns/op    
BenchmarkXoRoShiRo128Plus.nextLong           avgt   25   0.909 ± 0.003   1.801 ± 0.009  ns/op    
BenchmarkXoRoShiRo128PlusPlus.nextDouble     avgt   25   1.941 ± 0.009   1.962 ± 0.007  ns/op    
BenchmarkXoRoShiRo128PlusPlus.nextInt100000  avgt   25   2.280 ± 0.009   2.357 ± 0.004  ns/op    
BenchmarkXoRoShiRo128PlusPlus.nextInt2301    avgt   25   2.372 ± 0.012   2.442 ± 0.008  ns/op    
BenchmarkXoRoShiRo128PlusPlus.nextLong       avgt   25   1.056 ± 0.004   1.884 ± 0.007  ns/op    
BenchmarkXoRoShiRo128StarStar.nextDouble     avgt   25   1.936 ± 0.007   2.063 ± 0.009  ns/op    
BenchmarkXoRoShiRo128StarStar.nextInt100000  avgt   25   2.563 ± 0.017   2.557 ± 0.009  ns/op    
BenchmarkXoRoShiRo128StarStar.nextInt2301    avgt   25   2.601 ± 0.007   2.597 ± 0.014  ns/op    
BenchmarkXoRoShiRo128StarStar.nextLong       avgt   25   1.122 ± 0.005   1.891 ± 0.007  ns/op    
BenchmarkXoShiRo256Plus.nextDouble           avgt   25   1.939 ± 0.009   1.994 ± 0.009  ns/op   
BenchmarkXoShiRo256Plus.nextInt100000        avgt   25   3.428 ± 0.014   2.473 ± 0.006  ns/op <--- GraalVM slower
BenchmarkXoShiRo256Plus.nextInt2301          avgt   25   3.642 ± 0.144   2.597 ± 0.013  ns/op <--- GraalVM slower
BenchmarkXoShiRo256Plus.nextLong             avgt   25   1.608 ± 0.009   1.953 ± 0.008  ns/op    
BenchmarkXoShiRo256PlusPlus.nextDouble       avgt   25   1.943 ± 0.010   2.079 ± 0.010  ns/op
BenchmarkXoShiRo256PlusPlus.nextInt100000    avgt   25   3.503 ± 0.021   2.580 ± 0.010  ns/op <--- GraalVM slower
BenchmarkXoShiRo256PlusPlus.nextInt2301      avgt   25   3.874 ± 0.026   2.662 ± 0.008  ns/op <--- GraalVM slower
BenchmarkXoShiRo256PlusPlus.nextLong         avgt   25   1.619 ± 0.010   1.919 ± 0.010  ns/op    
BenchmarkXoShiRo256StarStar.nextDouble       avgt   25   1.941 ± 0.013   2.287 ± 0.009  ns/op  
BenchmarkXoShiRo256StarStar.nextInt100000    avgt   25   3.642 ± 0.016   2.675 ± 0.015  ns/op <--- GraalVM slower
BenchmarkXoShiRo256StarStar.nextInt2301      avgt   25   3.759 ± 0.020   2.641 ± 0.010  ns/op <--- GraalVM slower
BenchmarkXoShiRo256StarStar.nextLong         avgt   25   1.648 ± 0.008   2.042 ± 0.008  ns/op    
BenchmarkXorShift1024StarPhi.nextDouble      avgt   25   1.942 ± 0.008   2.261 ± 0.008  ns/op    
BenchmarkXorShift1024StarPhi.nextInt100000   avgt   25   2.466 ± 0.006   3.286 ± 0.019  ns/op    
BenchmarkXorShift1024StarPhi.nextInt2301     avgt   25   2.610 ± 0.010   3.433 ± 0.012  ns/op    
dougxc commented 2 years ago

@mur47x111 can you please look into this. I am suspecting missing intrinsics.

vigna commented 2 years ago

The columns in the table are unaligned. I think this shows the difference more clearly:

You're totally right. It was an embarrassing mistake.

vigna commented 1 year ago

There's another really weird thing happening here: essentially all nextDouble() methods have the same timing. This is nonsensical, as the base nextLong() method timings vary significantly.

mur47x111 commented 1 year ago

Sorry for not replying you earlier. These benchmarks boil down to duplicated memory stores at nextLong() implementations, e.g. XoShiRo256StarStarRandom.java#L102-L109. We are working on a more general read/write elimination optimization. Meanwhile if you want to bypass this regression, you can make the first two s2 s3 computations assigning to local variables.

vigna commented 1 year ago

Thanks, I'll do that. Any idea why all the nextDouble() methods have the same timing, in spite of major differences between PRNGs?

mur47x111 commented 1 year ago

In some of the random implementations (I have not verified all), nextDouble are implemented by nextLong

vigna commented 1 year ago

Yes, that's my point. The timings of nextLong() are very different from PRNG to PRNG, but as soon I benchmark nextDouble() (which just calls nextLong(), shifts and multiply by 2^-53) all timings are the same. It does not make any sense...

mur47x111 commented 1 year ago

ah ok. I misinterpret your previous comment. Then it is either our existing read/write elimination can handle that, or HotSpot C2's read/write elimination cannot handle that. Try use -XX:CompileCommand=dontinline,package/to/XoShiRo256StarStarRandom.nextLong when evaluating nextDouble and see if HotSpot C2 returns significant different scores.

vigna commented 1 year ago

But the timings are the same even with radically different PRNGs, with which there is no problem of read/write elimination, such as SplitMix64 (whose state is just one long). Almost all nextDouble() timings are 1.9ns.

So I should run the benchmark with the standard JVM and that option, right?

mur47x111 commented 1 year ago

I will take a look later at the emitted assembly.

So I should run the benchmark with the standard JVM and that option, right?

yes, and replace package/to with the actual package name or *

vigna commented 1 year ago

Did you have any chance to look at the assembly?

mur47x111 commented 1 year ago

I did not find anything wrong in the assembly. The nextLong workloads are benchmarking 2 - 6 memory accesses, and the nextDouble are benchmarking the same amounts of memory accesses with several fp instructions. My theory is that the fp instructions hide the latencies of memory accesses in pipelining. Try make the nextDouble workloads running only 1 operation per invocation.

vigna commented 1 year ago

Sorry for not replying you earlier. These benchmarks boil down to duplicated memory stores at nextLong() implementations, e.g. XoShiRo256StarStarRandom.java#L102-L109. We are working on a more general read/write elimination optimization. Meanwhile if you want to bypass this regression, you can make the first two s2 s3 computations assigning to local variables.

BTW, should I try with the latest version and expect something different?

mur47x111 commented 1 year ago

the optimization is still WIP