vigna / fastutil

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues.
Apache License 2.0
1.81k stars 199 forks source link

Optimize ArrayList #316

Closed mouse0w0 closed 4 months ago

mouse0w0 commented 9 months ago

In the previous pull request (#311), I optimized the array field access of some classes. Recently, with my deeper understanding of fastutil, I have discovered more optimizations and technical debts that need attention. However, resolving these issues is challenging and time-consuming, requiring contributions from the community and multiple pull requests. This pull request is the recent work I have done on ArrayList.

incaseoftrouble commented 9 months ago

Did you profile these changes? (With a proper JMH setup)

I'm doubtful that these changes have any noticeable performance improvements (the JVM is incredibly good at figuring out these things). In particular, copying fields to local variables in my experience just adds clutter and doesn't yield anything. (Why I come to this conclusion: I also thought that it helps, after all, the compiler cannot prove the absence of concurrent modification. But even in really costly iterative methods this made practically no difference. I think this is partly due to such a local variable needing to be copied and written to memory, since, indeed, the compiler cannot prove that not copying the value will have no influence.)

Similarly, marking local variables as final has (hardly) any influence. To my knowledge, the only possibility is that the compiler can perform some static analysis (which the JIT would very likely figure out, too) - final is discarded from the byte-code.

mouse0w0 commented 9 months ago

I conducts a benchmark for List forEach. The benchmark code is as follows:

@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 10)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
    @Param({"10", "1000", "100000"})
    private int n;

    private int[] a;

    @Setup
    public void setup() {
        int[] a = new int[n];
        for (int i = 0; i < n; i++) {
            a[i] = i;
        }
        this.a = a;
    }

    @Benchmark
    public void forEach(Blackhole blackhole) {
        for (int i = 0; i < n; i++) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void forEachCacheArray(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0; i < n; i++) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void forEachCacheSize(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0, max = n; i < max; i++) {
            blackhole.consume(a[i]);
        }
    }
}

The benchmark results are as follows:

Benchmark                                (n)  Mode  Cnt      Score     Error  Units
IterationBenchmark.forEach                10  avgt   10      7.003 ±   0.006  ns/op
IterationBenchmark.forEach              1000  avgt   10    656.301 ±   0.153  ns/op
IterationBenchmark.forEach            100000  avgt   10  65878.207 ± 424.070  ns/op
IterationBenchmark.forEachCacheArray      10  avgt   10      4.208 ±   0.004  ns/op
IterationBenchmark.forEachCacheArray    1000  avgt   10    330.056 ±   0.024  ns/op
IterationBenchmark.forEachCacheArray  100000  avgt   10  32982.049 ±   6.810  ns/op
IterationBenchmark.forEachCacheSize       10  avgt   10      2.805 ±   0.004  ns/op
IterationBenchmark.forEachCacheSize     1000  avgt   10    105.135 ±   0.013  ns/op
IterationBenchmark.forEachCacheSize   100000  avgt   10  13622.145 ±   9.760  ns/op

Based on the benchmark results, caching the array object and size of the list reduces iteration time by over 70% compared to not caching. Regarding the benchmark results, my opinions are as follows:

  1. When the code attempts to get a field of the this object, it needs to execute two opcodes, aload 0 and getfield. However, when getting a local variable, only one opcode is required, such as iload 1. Therefore, if the code needs to access a field value more than twice, the number of opcodes for accessing fields is greater than accessing local variables.

  2. Local variables are stored in the stack rather than the heap, implying that they can be released from memory when the method returns.

  3. Computers tend to store local variables in the CPU cache rather than memory, facilitating faster access.

  4. The getfield opcode is relatively time-consuming, which to some extent affects performance.

Furthermore, using the final modifier to declare local variables is to conform to the existing code style of fastutil. Looking forward to your reply.

incaseoftrouble commented 9 months ago

Based on the benchmark results, caching the array object and size of the list reduces iteration time by over 70% compared to not caching.

... on this benchmark that doesn't have dynamic dispatch etc. involved and is about forEach (I'm totally in favor of specializing forEach!)

I doubt that there will be any difference (there might even be a penalty) for get and set (i.e. methods where assigning the local variable takes a constant fraction of the overall number of operations)

Also: Which version of Java are you using? I cannot replicate these findings. For me, all three are the same timing:

IterationBenchmark.forEach            100000  avgt    4  12461.602 ± 438.815  ns/op
IterationBenchmark.forEachCacheArray  100000  avgt    4  12404.329 ± 242.271  ns/op
IterationBenchmark.forEachCacheSize   100000  avgt    4  12649.409 ± 985.465  ns/op

opcodes

All these assumptions are assumptions. The bytecode is not at all what is actually produced by the JIT, which in turn is not at all what the actual operations executed on the CPU are (see pipelining, speculative execution / branch prediction, etc.). Such constant-factor optimizations need (as realistic as possible) profiling as justification.

(regarding point 2: The field however already exists on the heap, so there is no need to write and then free that memory at all!)

using the final modifier to declare local variables is to conform to the existing code style of fastutil.

Oh, right! Then making all effectively final variables final definitely is good :)

incaseoftrouble commented 9 months ago

PS: I don't want to sound overly negative! I've just seen way too often that people try to outsmart a compiler, which more often than not actually is counterproductive.

szarnekow commented 9 months ago

It's interesting. I ran the very same benchmark out of curiosity on Hotspot 17.0.2 both on Mac M1 and on x86 and see the same improvements as @mouse0w0 when caching the reference to the array and its length. Even for very small data sets.

incaseoftrouble commented 9 months ago

It's interesting. I ran the very same benchmark out of curiosity on Hotspot 17.0.2 both on Mac M1 and on x86 and see the same improvements as @mouse0w0 when caching the reference to the array and its length. Even for very small data sets.

Oh, that is surprising! I am on the same major version, to be precise # VM version: JDK 17.0.9, OpenJDK 64-Bit Server VM, 17.0.9+9-Ubuntu-122.04, Blackhole mode: compiler and for me there really are no differences. I tried several variations of this benchmark.

Can you try with .0.9? I doubt that this changes with a minor version, but still. I don't get why there should be such drastic differences.

incaseoftrouble commented 9 months ago

@szarnekow @mouse0w0

Could you try this benchmark variant:


@State(Scope.Benchmark)
@Warmup(time = 2, iterations = 2)
@Measurement(time = 2, iterations = 4)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
    @Param({"100000"})
    private int n;

    private int[] a;

    @Setup
    public void setup() {
        int[] a = new int[n];
        Arrays.setAll(a, IntUnaryOperator.identity());
        this.a = a;
    }

    @Benchmark
    public void forEach(Blackhole blackhole) {
        for (int i = 0; i < n; i++) {
            a[i] += i;
        }
        blackhole.consume(a);
    }

    @Benchmark
    public void forEachCacheArray(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0; i < n; i++) {
            a[i] += i;
        }
        blackhole.consume(a);
    }

    @Benchmark
    public void forEachCacheSize(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0, max = n; i < max; i++) {
            a[i] += i;
        }
        blackhole.consume(a);
    }
}

(I just want to eliminate the low probability that it has something to do with the used blackhole implementation)

mouse0w0 commented 9 months ago

My benchmark result is based on # VM version: JDK 17.0.7, Java HotSpot(TM) 64-Bit Server VM, 17.0.7+8-LTS-224.

incaseoftrouble commented 9 months ago

Ok this is very surprising. Let me try if I can reproduce with HotSpot and, if so, if it is hotspot being slow or openjdk being fast.

EDIT: Nope. JDK 17.0.10, Java HotSpot(TM) 64-Bit Server VM, 17.0.10+11-LTS-240 seems to be slightly faster overall (~2%) but the "naive" forEach actually is fastest ...

mouse0w0 commented 9 months ago

Run your benchmark variant in JDK 17.0.7, the benchmark results are as follows:

Benchmark                                (n)  Mode  Cnt      Score     Error  Units
IterationBenchmark.forEach            100000  avgt    4  24642.541 ± 212.961  ns/op
IterationBenchmark.forEachCacheArray  100000  avgt    4  24641.441 ± 106.699  ns/op
IterationBenchmark.forEachCacheSize   100000  avgt    4  24642.589 ±  38.049  ns/op
mouse0w0 commented 9 months ago

I think the reason for this result is that the benchmark variant only consumes the array object itself, rather than consuming the data within the array object.

incaseoftrouble commented 9 months ago

I think the reason for this result is that the benchmark variant only consumes the array object itself, rather than consuming the data within the array object.

But the compiler cannot know that .consume(a) does not access the content of the array, thus it cannot skip the for loop.

I re-ran the original benchmark with HotSpot and get the same results:

Benchmark                                (n)  Mode  Cnt      Score   Error  Units
IterationBenchmark.forEach            100000  avgt    2  12524.095          ns/op
IterationBenchmark.forEachCacheArray  100000  avgt    2  12577.578          ns/op
IterationBenchmark.forEachCacheSize   100000  avgt    2  12394.182          ns/op

For completeness: JMH version: 1.36. But I am sort of at a loss right now. I have no idea how our results could be that different.

mouse0w0 commented 9 months ago

But my benchmark aligns with the function of forEach, obtaining and consuming one element at a time. We need to clarify what the JVM is doing.

mouse0w0 commented 9 months ago

My JMH version is 1.37.

incaseoftrouble commented 9 months ago

But my benchmark aligns with the function of forEach, obtaining and consuming one element at a time. We need to clarify what the JVM is doing.

Absolutely. This was just an attempt at pinpointing where the difference comes from.

Let's try this:

I started from an empty folder and ran what the JMH guide suggests:

mvn archetype:generate \
  -DinteractiveMode=false \
  -DarchetypeGroupId=org.openjdk.jmh \
  -DarchetypeArtifactId=jmh-java-benchmark-archetype \
  -DgroupId=org.sample \
  -DartifactId=test \
  -Dversion=1.0

Then, I put the following code into src/main/java/bench/IterationBenchmark.java:

package bench;

import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;

@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 5)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
    @Param({"100000"})
    private int n;

    private int[] a;

    @Setup
    public void setup() {
        int[] a = new int[n];
        for (int i = 0; i < n; i++) {
            a[i] = i;
        }
        this.a = a;
    }

    @Benchmark
    public void forEach(Blackhole blackhole) {
        for (int i = 0; i < n; i++) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void forEachCacheArray(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0; i < n; i++) {
            blackhole.consume(a[i]);
        }
    }

    @Benchmark
    public void forEachCacheSize(Blackhole blackhole) {
        int[] a = this.a;
        for (int i = 0, max = n; i < max; i++) {
            blackhole.consume(a[i]);
        }
    }
}

followed by mvn clean verify and java -jar target/benchmarks.jar.

Output:

# JMH version: 1.37
# VM version: JDK 17.0.9, OpenJDK 64-Bit Server VM, 17.0.9+9-Ubuntu-122.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 3 iterations, 10 s each
# Measurement: 2 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op

and

Benchmark                                (n)  Mode  Cnt      Score   Error  Units
IterationBenchmark.forEach            100000  avgt    2  12523.915          ns/op
IterationBenchmark.forEachCacheArray  100000  avgt    2  12503.226          ns/op
IterationBenchmark.forEachCacheSize   100000  avgt    2  12481.488          ns/op

(* I ran this with iterations = 2 because I didn't want to wait :) but the numbers are so consistent that iterations=1 already should be enough)

szarnekow commented 9 months ago

Original benchmark; M1 native;

JDK 19.0.2, OpenJDK 64-Bit Server VM, 19.0.2+7

Benchmark                                   (n)  Mode  Cnt  Score   Error  Units
IterationBenchmark.forEach                   10  avgt   10  7,168 ± 0,177  ns/op
IterationBenchmark.forEachCacheArrayLength   10  avgt   10  2,375 ± 0,129  ns/op
IterationBenchmark.forEachCacheField         10  avgt   10  3,860 ± 0,141  ns/op

JDK 21.0.1, OpenJDK 64-Bit Server VM, 21.0.1+12-LTS

Benchmark                                   (n)  Mode  Cnt  Score   Error  Units
IterationBenchmark.forEach                   10  avgt   10  2,547 ± 0,059  ns/op
IterationBenchmark.forEachCacheArrayLength   10  avgt   10  2,541 ± 0,036  ns/op
IterationBenchmark.forEachCacheField         10  avgt   10  2,556 ± 0,065  ns/op

Numbers on 17.0.9 or newer look similar to 21 for me

mouse0w0 commented 9 months ago

Upgrade JVM to 17.0.10 and get the same results as you:

# JMH version: 1.37
# VM version: JDK 17.0.10, Java HotSpot(TM) 64-Bit Server VM, 17.0.10+11-LTS-240
# VM invoker: C:\Program Files\jdk-17\bin\java.exe
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 3 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op

and

Benchmark                                (n)  Mode  Cnt      Score     Error  Units
IterationBenchmark.forEach            100000  avgt    5  13633.378 ± 892.284  ns/op
IterationBenchmark.forEachCacheArray  100000  avgt    5  13517.690 ±  47.061  ns/op
IterationBenchmark.forEachCacheSize   100000  avgt    5  13550.632 ± 198.555  ns/op
incaseoftrouble commented 9 months ago

Huh! I didn't expect this to come out of a minor version change. Well, that at least explains what happens in that particular case ... Thanks for trying with an upgraded version!

mouse0w0 commented 9 months ago

Benchmark in JVM versions 17.0.8 and 17.0.9 respectively, the results indicate that the change occurred in 17.0.9.

incaseoftrouble commented 9 months ago

Aha! The 17.0.9 changelog includes hotspot/compiler C2 Blackholes should allow load optimizations

Since your performance was similar when the blackhole was moved outside the loop, I am guessing this is the culprit. This would mean the performance difference was caused by blackhole and would not be observable in real applications (which aligns with what I have seen in e.g. JBDD)

vigna commented 9 months ago

Wow. That's a great and constructive discussion!

My 2€¢: I welcomed the PRs from @mouse0w0 because this is how fastutil code is written. Beyond 2 accesses to the same array I try to cache. I know it's up and down with actual improvements etc., but semantically it really makes sense and aften works (remember we're supporting Java 8).

@mouse0w0 noted that in some places I wasn't being coherent and fixed those places. So I'm totally happy to see this discussion but at least for the time being (and seeing different results on different architectures) for 3 accesses or more the rule is to cache. If you spot a relevant method where this does not happen I'm happy to accept PRs.

incaseoftrouble commented 9 months ago

for 3 accesses or more the rule is to cache

I'm fine with that. Its just important to note that a) this primarily is a style decision, not performance and b) it should be tested that such a change doesn't actually negatively impact performance (which can happen).

The only thing I am "unhappy" about is the for (int i = 0, max = size; ...). I find these not nice to read. If the size field should be copied to local, I would prefer int size = this.size; for (int i = 0; i < size; i++) {...}. But this is up for discussion.

vigna commented 9 months ago

Well, the rule tries to avoid the bad case. It is physically impossible to try with all type combinations and all instances whether there will be a negative impact on performance. Thus the 3+ rule.

What about for(i = 0, _size = size, i < _size; i++)?

incaseoftrouble commented 9 months ago

Well, the rule tries to avoid the bad case. It is physically impossible to try with all type combinations and all instances whether there will be a negative impact on performance. Thus the 3+ rule.

Sure, I just meant that we cant just assume that local caching makes things faster. We tested it, saw that it doesn't make a difference, great

for(i = 0, _size = size, i < _size; i++)?

I equally dislike that - for me, a for loop should just have a single initialization. But its your project and your style decision :)

mouse0w0 commented 9 months ago

for (int i = 0, max = size; ...)

This code style has been applied to some of fastutil's templates, and if we don't use it, we'll need to change some of the templates to keep it consistent.

int size = this.size; for (int i = 0; i < size; i++) {...}

Personally, I prefer this code style as well, int size = this.size is already heavily used in this pull request. And to simplify the code in the future, we can simply remove it without any problems.

vigna commented 9 months ago

for (int i = 0, max = size; ...)

This code style has been applied to some of fastutil's templates, and if we don't use it, we'll need to change some of the templates to keep it consistent.

int size = this.size; for (int i = 0; i < size; i++) {...}

Personally, I prefer this code style as well, int size = this.size is already heavily used in this pull request. And to simplify the code in the future, we can simply remove it without any problems.

Ok, so let's go for int size = this.size;. One just has to be sure not to change the semantics of the the method, as this. is not compulsory in Java. I'd thus suggest final int size = this.size;. In this way, if in the rest of the method where was some mutation for this.size we'll get an error.

vigna commented 4 months ago

This PR makes tests fail (didn't you check?). Apparently there is a change of behavior of the list used to check the behavior of big lists. I don't have time at the moment to chase this, as it might be involved.

I'm releasing 8.5.14 as it's overdue. If you can find the problem I'll add this to the next release.

mouse0w0 commented 4 months ago

My fault. I am not yet familiar with using ant junit, and the lack of prominent failures made me ignore it. I will fix it.

vigna commented 4 months ago

You can run the test ObjectBigArrayBigListTest from your IDE, or ant junit but that will run everything.

Note that in principle something that you fixed might have uncovered a bug in big lists. I am not keen to think that because those tests ran happily for +20 years.

Unfortunately the failed test is a stress test that call itself recursively, which makes not easy to track the point in which divergence between the big list and the standard list happens.

mouse0w0 commented 4 months ago

The issue is located in AbstractObjectList.clear and Arrays.ensureFromTo. When calling clear on an empty SubList, it calls ObjectArrayList.removeElements, with both from and to being 0. According to the Javadoc, the to parameter in removeElements is exclusive, but the to parameter in Arrays.ensureFromTo is inclusive. Upon inspection, the to parameters of all methods that call Arrays.ensureFromTo is exclusive. Therefore, Arrays.ensureFromTo should be fixed.

mouse0w0 commented 4 months ago

According to the implementation of Java's Objects.checkFromToIndex, the range from 0 (inclusive) to 0 (exclusive) is valid, which means it doesn't include any indices. Arrays.ensureFromTo does not need to be changed, but its javadoc is incorrect.

vigna commented 4 months ago

I'm totally confused. What did you fix codewise (forget the docs for the moment)?

mouse0w0 commented 4 months ago

The problem lies in my removeElements method, where I mistakenly filled the object array. This has been fixed in #327. However, the javadoc description for the to parameter in Arrays.ensureFromTo is incorrect, which misled me into thinking the issue was in its implementation. The Arrays.ensureFromTo original javadoc is as follows: https://github.com/vigna/fastutil/blob/5eaffbbd849ec8b034046e4a8b1a0cc410457778/src/it/unimi/dsi/fastutil/Arrays.java#L52-L53 https://github.com/vigna/fastutil/blob/5eaffbbd849ec8b034046e4a8b1a0cc410457778/src/it/unimi/dsi/fastutil/Arrays.java#L63-L65 It should be: * @param to an end index (inclusive exclusive).

vigna commented 3 months ago

Ahh OK, I see. The text is correct. The parameter description is wrong.

vigna commented 3 months ago

WTF, I have the new release but Sonatype gives me Unauthorized (401). Nothing has changed in my configuration since 5 months ago. If someone has suggestions, let me know, I tried on the community forum but I just met other people with the same problem...

mouse0w0 commented 3 months ago

Sonatype now enforces token authentication instead of username and password. Refer to the user manual to generate your token.

vigna commented 3 months ago

Thanks for the tip! They could provide a more sensible error message, frankly (or I'm just getting addicted to the Rust ecosystem).