Array iteration performance issue

mouse0w0 commented 11 months ago

In a benchmark focused on array iteration, I observed a substantial performance improvement of nearly four times when using foreach or storing the array in a local variable. I noticed that in certain array iteration scenarios within fastutil, arrays are not stored in local variables (e.g., ArrayList.drv, ArrayMap.drv). I'm uncertain whether this should be considered as an issue. Below is my benchmark code:

package benchmark;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.util.Arrays;
import java.util.concurrent.TimeUnit;

@Fork(1)
@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.SECONDS)
public class ArrayIterationBenchmark {
    private int[] array;

    @Setup
    public void setup() {
        array = new int[1 << 20];
        for (int i = 0; i < array.length; i++) {
            array[i] = i;
        }
    }

    @Benchmark
    public void foreach(Blackhole blackhole) {
        for (int i : array) {
            blackhole.consume(i);
        }
    }

    @Benchmark
    public void lambda(Blackhole blackhole) {
        Arrays.stream(array).forEach(blackhole::consume);
    }

    @Benchmark
    public void forward(Blackhole blackhole) {
        for (int i = 0; i < array.length; i++) {
            blackhole.consume(array[i]);
        }
    }

    @Benchmark
    public void forwardWithStoreArray(Blackhole blackhole) {
        final int[] a = array;
        for (int i = 0; i < a.length; i++) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void forwardWithStoreLength(Blackhole blackhole) {
        final int[] a = array;
        final int length = a.length;
        for (int i = 0; i < length; i++) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void reversed(Blackhole blackhole) {
        for (int i = array.length - 1; i >= 0; i--) {
            blackhole.consume(array[i]);
        }
    }

    @Benchmark
    public void reversedWithStoreArray(Blackhole blackhole) {
        final int[] a = array;
        for (int i = a.length - 1; i >= 0; i--) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void reversed2(Blackhole blackhole) {
        for (int i = array.length; i-- != 0; ) {
            blackhole.consume(array[i]);
        }
    }

    @Benchmark
    public void reversed2WithStoreArray(Blackhole blackhole) {
        final int[] a = array;
        for (int i = a.length; i-- != 0; ) {
            blackhole.consume(a[i]);
        }
    }
}

Benchmark result:

Benchmark                                         Mode  Cnt     Score     Error  Units
ArrayIterationBenchmark.foreach                  thrpt   10  6735.922 ± 192.578  ops/s
ArrayIterationBenchmark.forward                  thrpt   10  1434.708 ±   2.032  ops/s
ArrayIterationBenchmark.forwardWithStoreArray    thrpt   10  6827.414 ±  23.978  ops/s
ArrayIterationBenchmark.forwardWithStoreLength   thrpt   10  6835.175 ±   9.712  ops/s
ArrayIterationBenchmark.lambda                   thrpt   10  6843.325 ± 198.511  ops/s
ArrayIterationBenchmark.reversed                 thrpt   10  2447.371 ± 157.413  ops/s
ArrayIterationBenchmark.reversed2                thrpt   10  2446.422 ± 182.167  ops/s
ArrayIterationBenchmark.reversed2WithStoreArray  thrpt   10  6041.881 ±  15.997  ops/s
ArrayIterationBenchmark.reversedWithStoreArray   thrpt   10  5986.691 ±  10.935  ops/s

I am seeking guidance on whether the decision not to store arrays in local variables is intentional or if it might be worth considering as an optimization opportunity. Your insights into this matter would be greatly appreciated.

vigna commented 11 months ago

Oh well. I try to cache locally whenever possible. If you see places where I didn't, it is because I forgot it. A PR would be welcome!

mouse0w0 commented 11 months ago

Additionally, I have noticed that iterating over the array in the forward direction is faster and more readable compared to iterating in reverse order. I believe this could be considered as an optimization, and I am willing to submit a pull request to modify the reverse iteration in fastutil if necessary. Looking forward to your guidance.

vigna commented 11 months ago

fastutil has been written > 20y ago, and at that time a series of compilation optimizations made reserve loop faster. I think things change and it's possible current JVM have better compilation. Do you have any benchmarks for, say, Arrays.fill()? And, what is java.util.Arrays.fill() doing today?

mouse0w0 commented 11 months ago

In the benchmark of filling array, reverse filling is twice as fast as forward filling in Java 8, with no significant differences observed in Java 17.

Benchmark code:

@Fork(1)
@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.SECONDS)
public class ArrayFillBenchmark {
    private int[] a;

    @Setup
    public void setup() {
        a = new int[1000000];
    }

    @Benchmark
    public void fill() {
        Arrays.fill(a, 1);
    }

    @Benchmark
    public void forward() {
        int[] a = this.a;
        for (int i = 0, len = a.length; i < len; i++) {
            a[i] = 1;
        }
    }

    @Benchmark
    public void reverse() {
        int[] a = this.a;
        for (int i = a.length - 1; i >= 0; i--) {
            a[i] = 1;
        }
    }

    @Benchmark
    public void reverse2() {
        int[] a = this.a;
        for (int i = a.length; i-- != 0;) {
            a[i] = 1;
        }
    }
}

Benchmark result:

JDK 1.8.0_311
Benchmark                      Mode  Cnt      Score     Error  Units
ArrayFillBenchmark.fill      thrpt   10   7719.026 ±  57.793  ops/s
ArrayFillBenchmark.forward   thrpt   10   7446.591 ± 355.939  ops/s
ArrayFillBenchmark.reverse   thrpt   10  11386.325 ± 416.348  ops/s
ArrayFillBenchmark.reverse2  thrpt   10  11604.816 ± 151.888  ops/s

JDK 17.0.7
Benchmark                      Mode  Cnt      Score     Error  Units
ArrayFillBenchmark.fill      thrpt   10  11792.673 ± 191.212  ops/s
ArrayFillBenchmark.forward   thrpt   10  11510.398 ± 449.012  ops/s
ArrayFillBenchmark.reverse   thrpt   10  11493.141 ± 243.214  ops/s
ArrayFillBenchmark.reverse2  thrpt   10  11593.859 ± 150.707  ops/s

incaseoftrouble commented 11 months ago

fastutil has been written > 20y ago, and at that time a series of compilation optimizations made reserve loop faster. I think things change and it's possible current JVM have better compilation. Do you have any benchmarks for, say, Arrays.fill()? And, what is java.util.Arrays.fill() doing today?

Arrays.fill, Arrays.copyOf, System.arraycopy etc. are AFAIK (a tiny bit) faster than their "naive" counterparts. Concretely System.arraycopy is a (more or less) direct native call I think and Arrays.copyOf delegates to that. Plus these methods can use vectorized instructions.

mouse0w0 commented 9 months ago

Based on the conclusion in pull request #316, running benchmark on JDK 17.0.10. Benchmark result:

# JMH version: 1.37
# VM version: JDK 17.0.10, Java HotSpot(TM) 64-Bit Server VM, 17.0.10+11-LTS-240
# VM invoker: C:\Program Files\jdk-17\bin\java.exe
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 2 iterations, 10 s each
# Measurement: 2 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time

Benchmark                                         Mode  Cnt     Score   Error  Units
ArrayIterationBenchmark.foreach                  thrpt    2  7329.832          ops/s
ArrayIterationBenchmark.forward                  thrpt    2  7329.824          ops/s
ArrayIterationBenchmark.forwardWithStoreArray    thrpt    2  7322.617          ops/s
ArrayIterationBenchmark.forwardWithStoreLength   thrpt    2  7263.240          ops/s
ArrayIterationBenchmark.lambda                   thrpt    2  7233.075          ops/s
ArrayIterationBenchmark.reversed                 thrpt    2  6325.382          ops/s
ArrayIterationBenchmark.reversed2                thrpt    2  6531.927          ops/s
ArrayIterationBenchmark.reversed2WithStoreArray  thrpt    2  6507.797          ops/s
ArrayIterationBenchmark.reversedWithStoreArray   thrpt    2  6535.968          ops/s

According to the benchmark results, iterating arrays in forward yields a 10% higher throughput compared to iterating arrays in reverse.

mouse0w0 commented 9 months ago

Whether caching fields or not, it will not affect performance.

vigna / fastutil

Array iteration performance issue #310