Open vigna opened 2 years ago
The columns in the table are unaligned. I think this shows the difference more clearly:
GraalVM HotSpot
Benchmark Mode Cnt Score Error Score Error Units
BenchmarkRandom.nextDouble avgt 25 17.096 ± 0.060 16.993 ± 0.081 ns/op
BenchmarkRandom.nextInt100000 avgt 25 8.727 ± 0.052 8.493 ± 0.032 ns/op
BenchmarkRandom.nextInt2301 avgt 25 19.316 ± 0.095 19.489 ± 0.109 ns/op
BenchmarkRandom.nextLong avgt 25 17.062 ± 0.080 17.002 ± 0.074 ns/op
BenchmarkSplitMix64.nextDouble avgt 25 1.935 ± 0.008 1.942 ± 0.009 ns/op
BenchmarkSplitMix64.nextInt100000 avgt 25 2.329 ± 0.010 2.185 ± 0.007 ns/op
BenchmarkSplitMix64.nextInt2301 avgt 25 2.477 ± 0.010 2.409 ± 0.012 ns/op
BenchmarkSplitMix64.nextLong avgt 25 0.979 ± 0.005 1.348 ± 0.006 ns/op
BenchmarkSplittableRandom.nextDouble avgt 25 1.937 ± 0.007 1.936 ± 0.008 ns/op
BenchmarkSplittableRandom.nextInt100000 avgt 25 1.904 ± 0.007 2.138 ± 0.009 ns/op
BenchmarkSplittableRandom.nextInt2301 avgt 25 11.656 ± 0.052 13.321 ± 0.053 ns/op
BenchmarkSplittableRandom.nextLong avgt 25 0.848 ± 0.005 1.348 ± 0.006 ns/op
BenchmarkThreadLocalRandom.nextDouble avgt 25 2.721 ± 0.016 3.073 ± 0.010 ns/op
BenchmarkThreadLocalRandom.nextInt100000 avgt 25 2.164 ± 0.008 2.612 ± 0.010 ns/op
BenchmarkThreadLocalRandom.nextInt2301 avgt 25 12.909 ± 0.144 13.446 ± 0.082 ns/op
BenchmarkThreadLocalRandom.nextLong avgt 25 1.056 ± 0.006 1.458 ± 0.008 ns/op
BenchmarkXoRoShiRo128Plus.nextDouble avgt 25 1.939 ± 0.010 1.940 ± 0.008 ns/op
BenchmarkXoRoShiRo128Plus.nextInt100000 avgt 25 2.128 ± 0.009 2.290 ± 0.010 ns/op
BenchmarkXoRoShiRo128Plus.nextInt2301 avgt 25 2.290 ± 0.010 2.324 ± 0.010 ns/op
BenchmarkXoRoShiRo128Plus.nextLong avgt 25 0.909 ± 0.003 1.801 ± 0.009 ns/op
BenchmarkXoRoShiRo128PlusPlus.nextDouble avgt 25 1.941 ± 0.009 1.962 ± 0.007 ns/op
BenchmarkXoRoShiRo128PlusPlus.nextInt100000 avgt 25 2.280 ± 0.009 2.357 ± 0.004 ns/op
BenchmarkXoRoShiRo128PlusPlus.nextInt2301 avgt 25 2.372 ± 0.012 2.442 ± 0.008 ns/op
BenchmarkXoRoShiRo128PlusPlus.nextLong avgt 25 1.056 ± 0.004 1.884 ± 0.007 ns/op
BenchmarkXoRoShiRo128StarStar.nextDouble avgt 25 1.936 ± 0.007 2.063 ± 0.009 ns/op
BenchmarkXoRoShiRo128StarStar.nextInt100000 avgt 25 2.563 ± 0.017 2.557 ± 0.009 ns/op
BenchmarkXoRoShiRo128StarStar.nextInt2301 avgt 25 2.601 ± 0.007 2.597 ± 0.014 ns/op
BenchmarkXoRoShiRo128StarStar.nextLong avgt 25 1.122 ± 0.005 1.891 ± 0.007 ns/op
BenchmarkXoShiRo256Plus.nextDouble avgt 25 1.939 ± 0.009 1.994 ± 0.009 ns/op
BenchmarkXoShiRo256Plus.nextInt100000 avgt 25 3.428 ± 0.014 2.473 ± 0.006 ns/op <--- GraalVM slower
BenchmarkXoShiRo256Plus.nextInt2301 avgt 25 3.642 ± 0.144 2.597 ± 0.013 ns/op <--- GraalVM slower
BenchmarkXoShiRo256Plus.nextLong avgt 25 1.608 ± 0.009 1.953 ± 0.008 ns/op
BenchmarkXoShiRo256PlusPlus.nextDouble avgt 25 1.943 ± 0.010 2.079 ± 0.010 ns/op
BenchmarkXoShiRo256PlusPlus.nextInt100000 avgt 25 3.503 ± 0.021 2.580 ± 0.010 ns/op <--- GraalVM slower
BenchmarkXoShiRo256PlusPlus.nextInt2301 avgt 25 3.874 ± 0.026 2.662 ± 0.008 ns/op <--- GraalVM slower
BenchmarkXoShiRo256PlusPlus.nextLong avgt 25 1.619 ± 0.010 1.919 ± 0.010 ns/op
BenchmarkXoShiRo256StarStar.nextDouble avgt 25 1.941 ± 0.013 2.287 ± 0.009 ns/op
BenchmarkXoShiRo256StarStar.nextInt100000 avgt 25 3.642 ± 0.016 2.675 ± 0.015 ns/op <--- GraalVM slower
BenchmarkXoShiRo256StarStar.nextInt2301 avgt 25 3.759 ± 0.020 2.641 ± 0.010 ns/op <--- GraalVM slower
BenchmarkXoShiRo256StarStar.nextLong avgt 25 1.648 ± 0.008 2.042 ± 0.008 ns/op
BenchmarkXorShift1024StarPhi.nextDouble avgt 25 1.942 ± 0.008 2.261 ± 0.008 ns/op
BenchmarkXorShift1024StarPhi.nextInt100000 avgt 25 2.466 ± 0.006 3.286 ± 0.019 ns/op
BenchmarkXorShift1024StarPhi.nextInt2301 avgt 25 2.610 ± 0.010 3.433 ± 0.012 ns/op
@mur47x111 can you please look into this. I am suspecting missing intrinsics.
The columns in the table are unaligned. I think this shows the difference more clearly:
You're totally right. It was an embarrassing mistake.
There's another really weird thing happening here: essentially all nextDouble() methods have the same timing. This is nonsensical, as the base nextLong() method timings vary significantly.
Sorry for not replying you earlier. These benchmarks boil down to duplicated memory stores at nextLong()
implementations, e.g. XoShiRo256StarStarRandom.java#L102-L109. We are working on a more general read/write elimination optimization. Meanwhile if you want to bypass this regression, you can make the first two s2 s3 computations assigning to local variables.
Thanks, I'll do that. Any idea why all the nextDouble() methods have the same timing, in spite of major differences between PRNGs?
In some of the random implementations (I have not verified all), nextDouble
are implemented by nextLong
Yes, that's my point. The timings of nextLong() are very different from PRNG to PRNG, but as soon I benchmark nextDouble() (which just calls nextLong(), shifts and multiply by 2^-53) all timings are the same. It does not make any sense...
ah ok. I misinterpret your previous comment. Then it is either our existing read/write elimination can handle that, or HotSpot C2's read/write elimination cannot handle that. Try use -XX:CompileCommand=dontinline,package/to/XoShiRo256StarStarRandom.nextLong
when evaluating nextDouble
and see if HotSpot C2 returns significant different scores.
But the timings are the same even with radically different PRNGs, with which there is no problem of read/write elimination, such as SplitMix64 (whose state is just one long). Almost all nextDouble() timings are 1.9ns.
So I should run the benchmark with the standard JVM and that option, right?
I will take a look later at the emitted assembly.
So I should run the benchmark with the standard JVM and that option, right?
yes, and replace package/to
with the actual package name or *
Did you have any chance to look at the assembly?
I did not find anything wrong in the assembly. The nextLong
workloads are benchmarking 2 - 6 memory accesses, and the nextDouble
are benchmarking the same amounts of memory accesses with several fp instructions. My theory is that the fp instructions hide the latencies of memory accesses in pipelining. Try make the nextDouble
workloads running only 1 operation per invocation.
Sorry for not replying you earlier. These benchmarks boil down to duplicated memory stores at
nextLong()
implementations, e.g. XoShiRo256StarStarRandom.java#L102-L109. We are working on a more general read/write elimination optimization. Meanwhile if you want to bypass this regression, you can make the first two s2 s3 computations assigning to local variables.
BTW, should I try with the latest version and expect something different?
the optimization is still WIP
While testing for speed some PRNG methods (notably, nextInt(int)) using JMH, results for GraalVM were, as expected, significantly better than HotSpot's, in some cases spectacularly. However, there is a class of methods for which GraalVM is 50% slower than HotSpot.
The JHM benchmarks are available at https://github.com/vigna/dsiutils/tree/master/prngperf
OpenJDK Runtime Environment GraalVM CE 22.3.0-dev (build 17.0.5+3-jvmci-22.3-b04) Linux Fedora 35 CPU Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
In the table below I highlighted the 6 tests in which GraalVM (timings on the left) is 50% slower than HotSpot (timings on the right). In all other cases GraalVM is much faster.
The involved methods contain 64-bit arithmetic (in particular, moduli).
https://github.com/vigna/dsiutils/blob/924bbb27bc2494f0c4a032cbc6b5e83ff9868628/src/it/unimi/dsi/util/XoShiRo256PlusPlusRandomGenerator.java#L145
Note that the method above is nextLong(long), but nextInt(int) just delegates to that.
The problematic generators use 4 longs of state. Identical methods for generators with 2 words of state do not have this problem:
https://github.com/vigna/dsiutils/blob/924bbb27bc2494f0c4a032cbc6b5e83ff9868628/src/it/unimi/dsi/util/XoRoShiRo128PlusPlusRandomGenerator.java#L138
Please let me know if there's any additional tests I can do to clarify the matter.