Open ChongHan opened 2 months ago
@ChongHan This seems to be JIT related indeed. Here what I got running this benchmark on my M2 laptop:
-XX:TieredStopAtLevel=0 -Dagrona.disable.bounds.checks=true
===========================================================
Benchmark Mode Cnt Score Error Units
ByteBufferBenchmark.writeAgronaByteBufferAligned avgt 3 3952.083 ± 416.133 ms/op
ByteBufferBenchmark.writeDirectBufferAligned avgt 3 9620.034 ± 168.978 ms/op
-XX:TieredStopAtLevel=1 -Dagrona.disable.bounds.checks=true
===========================================================
Benchmark Mode Cnt Score Error Units
ByteBufferBenchmark.writeAgronaByteBufferAligned avgt 3 19.853 ± 5.872 ms/op
ByteBufferBenchmark.writeDirectBufferAligned avgt 3 103.269 ± 2.050 ms/op
-XX:TieredStopAtLevel=2 -Dagrona.disable.bounds.checks=true
===========================================================
Benchmark Mode Cnt Score Error Units
ByteBufferBenchmark.writeAgronaByteBufferAligned avgt 3 65.892 ± 4.035 ms/op
ByteBufferBenchmark.writeDirectBufferAligned avgt 3 143.316 ± 35.288 ms/op
-XX:TieredStopAtLevel=3 -Dagrona.disable.bounds.checks=true
===========================================================
Benchmark Mode Cnt Score Error Units
ByteBufferBenchmark.writeAgronaByteBufferAligned avgt 3 129.152 ± 14.025 ms/op
ByteBufferBenchmark.writeDirectBufferAligned avgt 3 499.864 ± 35.398 ms/op
-XX:TieredStopAtLevel=4 -Dagrona.disable.bounds.checks=true
===========================================================
Benchmark Mode Cnt Score Error Units
ByteBufferBenchmark.writeAgronaByteBufferAligned avgt 3 14.774 ± 0.359 ms/op
ByteBufferBenchmark.writeDirectBufferAligned avgt 3 5.287 ± 0.392 ms/op
As you can see at all tiers prior to tier 4 the Agrona buffer is faster than the JDK buffer.
I had a look at the generated Assembly. If the UnsafeBuffer is pointing to a regular bytebuffer is done first, the code isn't very well optimized. The loop is processing 1 item at a time (no loop unrolling).
If the unrelatedAgronaBuffer section is removed, the for-loop is unrolled with a factor of 16 and this gives a lot more headroom due to super scalar nature of modern processors. Also it reduces the overhead of loop control.
My guess is that the JIT reduces its optimizations if at the call-site if there are 2 types of ByteBuffers to deal with (on heap and a direct bytebuffer).
Dear maintainer,
I'm encountering unexpected behavior while benchmarking Agrona's
UnsafeBuffer
. Writing to an unrelatedUnsafeBuffer
in the setup phase seems to cause a substantial (~2x) slowdown in the subsequent benchmark of writing to a separate, alignedUnsafeBuffer
.Benchmark
I've set up a JMH benchmark comparing a direct
ByteBuffer
to an alignedUnsafeBuffer
:Result
Observed Behavior
When the write to
unrelatedAgronaBuffer
is included in the setup, the benchmark forwriteAgronaByteBufferAligned()
is approximately 2x slower than when it is excluded. Writing todirectBufferAligned
does not seem to be affected.Assembly
I've captured assembly together with JMH - https://gist.github.com/ChongHan/890ca81a88e6c275d022c7ec6351c0b8
Environment
Agrona: 1.20.0 JMH: 1.37 Java: Azul Zulu 21.0.4 CPU: AMD Ryzen 5 3600 OS: Ubuntu 22.04 Kernel: 5.15
Additional Information
I've also tested this benchmark using GraalVM CE 21, and the performance degradation caused by writing to the unrelated UnsafeBuffer appears to be less pronounced compared to the standard JDK. I'd like to know if this could be a known issue or limitation of the JIT compiler, specifically in relation to HotSpot. Any insights or known behavior related to this would be appreciated.