Writing to an Unrelated UnsafeBuffer Significantly Impacts Benchmark Performance

ChongHan commented 2 months ago

Dear maintainer,

I'm encountering unexpected behavior while benchmarking Agrona's UnsafeBuffer. Writing to an unrelated UnsafeBuffer in the setup phase seems to cause a substantial (~2x) slowdown in the subsequent benchmark of writing to a separate, aligned UnsafeBuffer.

Benchmark

I've set up a JMH benchmark comparing a direct ByteBuffer to an aligned UnsafeBuffer:

@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 1)
@Measurement(iterations = 3)
@State(Scope.Benchmark)
public class ByteBufferBenchmark {
    private static final int[] CORE_TO_USE = {5, 11};
    private static final ByteOrder ENDIAN = ByteOrder.nativeOrder();
    private static final int ALIGNMENT = 32;
    private static final int SIZE = 1024 * 1024 * 128; // 128MB

    private AffinityLock affinityLock;

    private ByteBuffer directBufferAligned;
    private UnsafeBuffer agronaByteBufferAligned;
    private UnsafeBuffer unrelatedAgronaBuffer;

    @Setup
    public void setup() {
        affinityLock = AffinityLock.acquireLock(CORE_TO_USE);

        directBufferAligned = BufferUtil.allocateDirectAligned(SIZE, ALIGNMENT).order(ENDIAN);
        agronaByteBufferAligned = new UnsafeBuffer(BufferUtil.allocateDirectAligned(SIZE, ALIGNMENT).order(ENDIAN));

        unrelatedAgronaBuffer = new UnsafeBuffer(ByteBuffer.allocate(SIZE).order(ENDIAN));
        for (int i = 0; i < SIZE; i += Integer.BYTES) {
            unrelatedAgronaBuffer.putInt(i, i); // why does this throw off JIT?
        }
    }

    @TearDown
    public void tearDown() {
        affinityLock.release();
    }

    @Benchmark
    public void writeDirectBufferAligned() {
        for (int i = 0; i < SIZE; i += Integer.BYTES) {
            directBufferAligned.putInt(i, i);
        }
    }

    @Benchmark
    public void writeAgronaByteBufferAligned() {
        for (int i = 0; i < SIZE; i += Integer.BYTES) {
            agronaByteBufferAligned.putInt(i, i);
        }
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(ByteBufferBenchmark.class.getSimpleName())
                .jvmArgsAppend("-Dagrona.disable.bounds.checks=true",
                        "-XX:+UnlockDiagnosticVMOptions",
                        "-XX:PrintAssemblyOptions=intel",
                        "-XX:+PrintAssembly")
                .addProfiler("perfasm", "events=cpu-clock;intelSyntax=true;top=3;hotThreshold=0.10")
                .build();
        new Runner(opt).run();
    }
}

Result

Benchmark                                             Mode  Cnt   Score   Error  Units
ByteBufferBenchmark.writeAgronaByteBufferAligned      avgt    3  22.722 ± 0.503  ms/op
ByteBufferBenchmark.writeDirectBufferAligned          avgt    3  12.253 ± 0.259  ms/op

Observed Behavior

When the write to unrelatedAgronaBuffer is included in the setup, the benchmark for writeAgronaByteBufferAligned() is approximately 2x slower than when it is excluded. Writing to directBufferAligned does not seem to be affected.

Assembly

I've captured assembly together with JMH - https://gist.github.com/ChongHan/890ca81a88e6c275d022c7ec6351c0b8

Environment

Agrona: 1.20.0 JMH: 1.37 Java: Azul Zulu 21.0.4 CPU: AMD Ryzen 5 3600 OS: Ubuntu 22.04 Kernel: 5.15

Additional Information

I've also tested this benchmark using GraalVM CE 21, and the performance degradation caused by writing to the unrelated UnsafeBuffer appears to be less pronounced compared to the standard JDK. I'd like to know if this could be a known issue or limitation of the JIT compiler, specifically in relation to HotSpot. Any insights or known behavior related to this would be appreciated.

vyazelenko commented 1 month ago

@ChongHan This seems to be JIT related indeed. Here what I got running this benchmark on my M2 laptop:

-XX:TieredStopAtLevel=0 -Dagrona.disable.bounds.checks=true
===========================================================
Benchmark                                         Mode  Cnt     Score     Error  Units
ByteBufferBenchmark.writeAgronaByteBufferAligned  avgt    3  3952.083 ± 416.133  ms/op
ByteBufferBenchmark.writeDirectBufferAligned      avgt    3  9620.034 ± 168.978  ms/op

-XX:TieredStopAtLevel=1 -Dagrona.disable.bounds.checks=true
===========================================================
Benchmark                                         Mode  Cnt    Score   Error  Units
ByteBufferBenchmark.writeAgronaByteBufferAligned  avgt    3   19.853 ± 5.872  ms/op
ByteBufferBenchmark.writeDirectBufferAligned      avgt    3  103.269 ± 2.050  ms/op

-XX:TieredStopAtLevel=2 -Dagrona.disable.bounds.checks=true
===========================================================
Benchmark                                         Mode  Cnt    Score    Error  Units
ByteBufferBenchmark.writeAgronaByteBufferAligned  avgt    3   65.892 ±  4.035  ms/op
ByteBufferBenchmark.writeDirectBufferAligned      avgt    3  143.316 ± 35.288  ms/op

-XX:TieredStopAtLevel=3 -Dagrona.disable.bounds.checks=true
===========================================================
Benchmark                                         Mode  Cnt    Score    Error  Units
ByteBufferBenchmark.writeAgronaByteBufferAligned  avgt    3  129.152 ± 14.025  ms/op
ByteBufferBenchmark.writeDirectBufferAligned      avgt    3  499.864 ± 35.398  ms/op

-XX:TieredStopAtLevel=4 -Dagrona.disable.bounds.checks=true
===========================================================
Benchmark                                         Mode  Cnt   Score   Error  Units
ByteBufferBenchmark.writeAgronaByteBufferAligned  avgt    3  14.774 ± 0.359  ms/op
ByteBufferBenchmark.writeDirectBufferAligned      avgt    3   5.287 ± 0.392  ms/op

As you can see at all tiers prior to tier 4 the Agrona buffer is faster than the JDK buffer.

pveentjer commented 3 weeks ago

I had a look at the generated Assembly. If the UnsafeBuffer is pointing to a regular bytebuffer is done first, the code isn't very well optimized. The loop is processing 1 item at a time (no loop unrolling).

If the unrelatedAgronaBuffer section is removed, the for-loop is unrolled with a factor of 16 and this gives a lot more headroom due to super scalar nature of modern processors. Also it reduces the overhead of loop control.

My guess is that the JIT reduces its optimizations if at the call-site if there are 2 types of ByteBuffers to deal with (on heap and a direct bytebuffer).

real-logic / agrona