Closed mouse0w0 closed 4 months ago
Did you profile these changes? (With a proper JMH setup)
I'm doubtful that these changes have any noticeable performance improvements (the JVM is incredibly good at figuring out these things). In particular, copying fields to local variables in my experience just adds clutter and doesn't yield anything. (Why I come to this conclusion: I also thought that it helps, after all, the compiler cannot prove the absence of concurrent modification. But even in really costly iterative methods this made practically no difference. I think this is partly due to such a local variable needing to be copied and written to memory, since, indeed, the compiler cannot prove that not copying the value will have no influence.)
Similarly, marking local variables as final has (hardly) any influence. To my knowledge, the only possibility is that the compiler can perform some static analysis (which the JIT would very likely figure out, too) - final
is discarded from the byte-code.
I conducts a benchmark for List forEach. The benchmark code is as follows:
@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 10)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
@Param({"10", "1000", "100000"})
private int n;
private int[] a;
@Setup
public void setup() {
int[] a = new int[n];
for (int i = 0; i < n; i++) {
a[i] = i;
}
this.a = a;
}
@Benchmark
public void forEach(Blackhole blackhole) {
for (int i = 0; i < n; i++) {
blackhole.consume(a[i]);
}
}
@Benchmark
public void forEachCacheArray(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0; i < n; i++) {
blackhole.consume(a[i]);
}
}
@Benchmark
public void forEachCacheSize(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0, max = n; i < max; i++) {
blackhole.consume(a[i]);
}
}
}
The benchmark results are as follows:
Benchmark (n) Mode Cnt Score Error Units
IterationBenchmark.forEach 10 avgt 10 7.003 ± 0.006 ns/op
IterationBenchmark.forEach 1000 avgt 10 656.301 ± 0.153 ns/op
IterationBenchmark.forEach 100000 avgt 10 65878.207 ± 424.070 ns/op
IterationBenchmark.forEachCacheArray 10 avgt 10 4.208 ± 0.004 ns/op
IterationBenchmark.forEachCacheArray 1000 avgt 10 330.056 ± 0.024 ns/op
IterationBenchmark.forEachCacheArray 100000 avgt 10 32982.049 ± 6.810 ns/op
IterationBenchmark.forEachCacheSize 10 avgt 10 2.805 ± 0.004 ns/op
IterationBenchmark.forEachCacheSize 1000 avgt 10 105.135 ± 0.013 ns/op
IterationBenchmark.forEachCacheSize 100000 avgt 10 13622.145 ± 9.760 ns/op
Based on the benchmark results, caching the array object and size of the list reduces iteration time by over 70% compared to not caching. Regarding the benchmark results, my opinions are as follows:
When the code attempts to get a field of the this
object, it needs to execute two opcodes, aload 0
and getfield
. However, when getting a local variable, only one opcode is required, such as iload 1
. Therefore, if the code needs to access a field value more than twice, the number of opcodes for accessing fields is greater than accessing local variables.
Local variables are stored in the stack rather than the heap, implying that they can be released from memory when the method returns.
Computers tend to store local variables in the CPU cache rather than memory, facilitating faster access.
The getfield
opcode is relatively time-consuming, which to some extent affects performance.
Furthermore, using the final modifier to declare local variables is to conform to the existing code style of fastutil. Looking forward to your reply.
Based on the benchmark results, caching the array object and size of the list reduces iteration time by over 70% compared to not caching.
... on this benchmark that doesn't have dynamic dispatch etc. involved and is about forEach (I'm totally in favor of specializing forEach
!)
I doubt that there will be any difference (there might even be a penalty) for get
and set
(i.e. methods where assigning the local variable takes a constant fraction of the overall number of operations)
Also: Which version of Java are you using? I cannot replicate these findings. For me, all three are the same timing:
IterationBenchmark.forEach 100000 avgt 4 12461.602 ± 438.815 ns/op
IterationBenchmark.forEachCacheArray 100000 avgt 4 12404.329 ± 242.271 ns/op
IterationBenchmark.forEachCacheSize 100000 avgt 4 12649.409 ± 985.465 ns/op
opcodes
All these assumptions are assumptions. The bytecode is not at all what is actually produced by the JIT, which in turn is not at all what the actual operations executed on the CPU are (see pipelining, speculative execution / branch prediction, etc.). Such constant-factor optimizations need (as realistic as possible) profiling as justification.
(regarding point 2: The field however already exists on the heap, so there is no need to write and then free that memory at all!)
using the final modifier to declare local variables is to conform to the existing code style of fastutil.
Oh, right! Then making all effectively final variables final definitely is good :)
PS: I don't want to sound overly negative! I've just seen way too often that people try to outsmart a compiler, which more often than not actually is counterproductive.
It's interesting. I ran the very same benchmark out of curiosity on Hotspot 17.0.2 both on Mac M1 and on x86 and see the same improvements as @mouse0w0 when caching the reference to the array and its length. Even for very small data sets.
It's interesting. I ran the very same benchmark out of curiosity on Hotspot 17.0.2 both on Mac M1 and on x86 and see the same improvements as @mouse0w0 when caching the reference to the array and its length. Even for very small data sets.
Oh, that is surprising! I am on the same major version, to be precise # VM version: JDK 17.0.9, OpenJDK 64-Bit Server VM, 17.0.9+9-Ubuntu-122.04
, Blackhole mode: compiler
and for me there really are no differences. I tried several variations of this benchmark.
Can you try with .0.9? I doubt that this changes with a minor version, but still. I don't get why there should be such drastic differences.
@szarnekow @mouse0w0
Could you try this benchmark variant:
@State(Scope.Benchmark)
@Warmup(time = 2, iterations = 2)
@Measurement(time = 2, iterations = 4)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
@Param({"100000"})
private int n;
private int[] a;
@Setup
public void setup() {
int[] a = new int[n];
Arrays.setAll(a, IntUnaryOperator.identity());
this.a = a;
}
@Benchmark
public void forEach(Blackhole blackhole) {
for (int i = 0; i < n; i++) {
a[i] += i;
}
blackhole.consume(a);
}
@Benchmark
public void forEachCacheArray(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0; i < n; i++) {
a[i] += i;
}
blackhole.consume(a);
}
@Benchmark
public void forEachCacheSize(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0, max = n; i < max; i++) {
a[i] += i;
}
blackhole.consume(a);
}
}
(I just want to eliminate the low probability that it has something to do with the used blackhole implementation)
My benchmark result is based on # VM version: JDK 17.0.7, Java HotSpot(TM) 64-Bit Server VM, 17.0.7+8-LTS-224
.
Ok this is very surprising. Let me try if I can reproduce with HotSpot
and, if so, if it is hotspot being slow or openjdk being fast.
EDIT: Nope. JDK 17.0.10, Java HotSpot(TM) 64-Bit Server VM, 17.0.10+11-LTS-240
seems to be slightly faster overall (~2%) but the "naive" forEach
actually is fastest ...
Run your benchmark variant in JDK 17.0.7
, the benchmark results are as follows:
Benchmark (n) Mode Cnt Score Error Units
IterationBenchmark.forEach 100000 avgt 4 24642.541 ± 212.961 ns/op
IterationBenchmark.forEachCacheArray 100000 avgt 4 24641.441 ± 106.699 ns/op
IterationBenchmark.forEachCacheSize 100000 avgt 4 24642.589 ± 38.049 ns/op
I think the reason for this result is that the benchmark variant only consumes the array object itself, rather than consuming the data within the array object.
I think the reason for this result is that the benchmark variant only consumes the array object itself, rather than consuming the data within the array object.
But the compiler cannot know that .consume(a)
does not access the content of the array, thus it cannot skip the for loop.
I re-ran the original benchmark with HotSpot and get the same results:
Benchmark (n) Mode Cnt Score Error Units
IterationBenchmark.forEach 100000 avgt 2 12524.095 ns/op
IterationBenchmark.forEachCacheArray 100000 avgt 2 12577.578 ns/op
IterationBenchmark.forEachCacheSize 100000 avgt 2 12394.182 ns/op
For completeness: JMH version: 1.36
. But I am sort of at a loss right now. I have no idea how our results could be that different.
But my benchmark aligns with the function of forEach, obtaining and consuming one element at a time. We need to clarify what the JVM is doing.
My JMH version is 1.37
.
But my benchmark aligns with the function of forEach, obtaining and consuming one element at a time. We need to clarify what the JVM is doing.
Absolutely. This was just an attempt at pinpointing where the difference comes from.
Let's try this:
I started from an empty folder and ran what the JMH guide suggests:
mvn archetype:generate \
-DinteractiveMode=false \
-DarchetypeGroupId=org.openjdk.jmh \
-DarchetypeArtifactId=jmh-java-benchmark-archetype \
-DgroupId=org.sample \
-DartifactId=test \
-Dversion=1.0
Then, I put the following code into src/main/java/bench/IterationBenchmark.java
:
package bench;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;
@State(Scope.Thread)
@Warmup(iterations = 3)
@Measurement(iterations = 5)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class IterationBenchmark {
@Param({"100000"})
private int n;
private int[] a;
@Setup
public void setup() {
int[] a = new int[n];
for (int i = 0; i < n; i++) {
a[i] = i;
}
this.a = a;
}
@Benchmark
public void forEach(Blackhole blackhole) {
for (int i = 0; i < n; i++) {
blackhole.consume(a[i]);
}
}
@Benchmark
public void forEachCacheArray(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0; i < n; i++) {
blackhole.consume(a[i]);
}
}
@Benchmark
public void forEachCacheSize(Blackhole blackhole) {
int[] a = this.a;
for (int i = 0, max = n; i < max; i++) {
blackhole.consume(a[i]);
}
}
}
followed by mvn clean verify
and java -jar target/benchmarks.jar
.
Output:
# JMH version: 1.37
# VM version: JDK 17.0.9, OpenJDK 64-Bit Server VM, 17.0.9+9-Ubuntu-122.04
# VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 3 iterations, 10 s each
# Measurement: 2 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
and
Benchmark (n) Mode Cnt Score Error Units
IterationBenchmark.forEach 100000 avgt 2 12523.915 ns/op
IterationBenchmark.forEachCacheArray 100000 avgt 2 12503.226 ns/op
IterationBenchmark.forEachCacheSize 100000 avgt 2 12481.488 ns/op
(* I ran this with iterations = 2
because I didn't want to wait :) but the numbers are so consistent that iterations=1 already should be enough)
Original benchmark; M1 native;
JDK 19.0.2, OpenJDK 64-Bit Server VM, 19.0.2+7
Benchmark (n) Mode Cnt Score Error Units
IterationBenchmark.forEach 10 avgt 10 7,168 ± 0,177 ns/op
IterationBenchmark.forEachCacheArrayLength 10 avgt 10 2,375 ± 0,129 ns/op
IterationBenchmark.forEachCacheField 10 avgt 10 3,860 ± 0,141 ns/op
JDK 21.0.1, OpenJDK 64-Bit Server VM, 21.0.1+12-LTS
Benchmark (n) Mode Cnt Score Error Units
IterationBenchmark.forEach 10 avgt 10 2,547 ± 0,059 ns/op
IterationBenchmark.forEachCacheArrayLength 10 avgt 10 2,541 ± 0,036 ns/op
IterationBenchmark.forEachCacheField 10 avgt 10 2,556 ± 0,065 ns/op
Numbers on 17.0.9 or newer look similar to 21 for me
Upgrade JVM to 17.0.10
and get the same results as you:
# JMH version: 1.37
# VM version: JDK 17.0.10, Java HotSpot(TM) 64-Bit Server VM, 17.0.10+11-LTS-240
# VM invoker: C:\Program Files\jdk-17\bin\java.exe
# VM options: <none>
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 3 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
and
Benchmark (n) Mode Cnt Score Error Units
IterationBenchmark.forEach 100000 avgt 5 13633.378 ± 892.284 ns/op
IterationBenchmark.forEachCacheArray 100000 avgt 5 13517.690 ± 47.061 ns/op
IterationBenchmark.forEachCacheSize 100000 avgt 5 13550.632 ± 198.555 ns/op
Huh! I didn't expect this to come out of a minor version change. Well, that at least explains what happens in that particular case ... Thanks for trying with an upgraded version!
Benchmark in JVM versions 17.0.8
and 17.0.9
respectively, the results indicate that the change occurred in 17.0.9
.
Aha! The 17.0.9 changelog includes hotspot/compiler C2 Blackholes should allow load optimizations
Since your performance was similar when the blackhole was moved outside the loop, I am guessing this is the culprit. This would mean the performance difference was caused by blackhole and would not be observable in real applications (which aligns with what I have seen in e.g. JBDD)
Wow. That's a great and constructive discussion!
My 2€¢: I welcomed the PRs from @mouse0w0 because this is how fastutil code is written. Beyond 2 accesses to the same array I try to cache. I know it's up and down with actual improvements etc., but semantically it really makes sense and aften works (remember we're supporting Java 8).
@mouse0w0 noted that in some places I wasn't being coherent and fixed those places. So I'm totally happy to see this discussion but at least for the time being (and seeing different results on different architectures) for 3 accesses or more the rule is to cache. If you spot a relevant method where this does not happen I'm happy to accept PRs.
for 3 accesses or more the rule is to cache
I'm fine with that. Its just important to note that a) this primarily is a style decision, not performance and b) it should be tested that such a change doesn't actually negatively impact performance (which can happen).
The only thing I am "unhappy" about is the for (int i = 0, max = size; ...)
. I find these not nice to read. If the size field should be copied to local, I would prefer int size = this.size; for (int i = 0; i < size; i++) {...}
. But this is up for discussion.
Well, the rule tries to avoid the bad case. It is physically impossible to try with all type combinations and all instances whether there will be a negative impact on performance. Thus the 3+ rule.
What about for(i = 0, _size = size, i < _size; i++)?
Well, the rule tries to avoid the bad case. It is physically impossible to try with all type combinations and all instances whether there will be a negative impact on performance. Thus the 3+ rule.
Sure, I just meant that we cant just assume that local caching makes things faster. We tested it, saw that it doesn't make a difference, great
for(i = 0, _size = size, i < _size; i++)?
I equally dislike that - for me, a for loop should just have a single initialization. But its your project and your style decision :)
for (int i = 0, max = size; ...)
This code style has been applied to some of fastutil's templates, and if we don't use it, we'll need to change some of the templates to keep it consistent.
int size = this.size; for (int i = 0; i < size; i++) {...}
Personally, I prefer this code style as well, int size = this.size
is already heavily used in this pull request. And to simplify the code in the future, we can simply remove it without any problems.
for (int i = 0, max = size; ...)
This code style has been applied to some of fastutil's templates, and if we don't use it, we'll need to change some of the templates to keep it consistent.
int size = this.size; for (int i = 0; i < size; i++) {...}
Personally, I prefer this code style as well,
int size = this.size
is already heavily used in this pull request. And to simplify the code in the future, we can simply remove it without any problems.
Ok, so let's go for int size = this.size;
. One just has to be sure not to change the semantics of the the method, as this.
is not compulsory in Java. I'd thus suggest final int size = this.size;
. In this way, if in the rest of the method where was some mutation for this.size
we'll get an error.
This PR makes tests fail (didn't you check?). Apparently there is a change of behavior of the list used to check the behavior of big lists. I don't have time at the moment to chase this, as it might be involved.
I'm releasing 8.5.14 as it's overdue. If you can find the problem I'll add this to the next release.
My fault. I am not yet familiar with using ant junit, and the lack of prominent failures made me ignore it. I will fix it.
You can run the test ObjectBigArrayBigListTest
from your IDE, or ant junit
but that will run everything.
Note that in principle something that you fixed might have uncovered a bug in big lists. I am not keen to think that because those tests ran happily for +20 years.
Unfortunately the failed test is a stress test that call itself recursively, which makes not easy to track the point in which divergence between the big list and the standard list happens.
The issue is located in AbstractObjectList.clear
and Arrays.ensureFromTo
. When calling clear
on an empty SubList, it calls ObjectArrayList.removeElements
, with both from
and to
being 0. According to the Javadoc, the to
parameter in removeElements
is exclusive, but the to
parameter in Arrays.ensureFromTo
is inclusive. Upon inspection, the to
parameters of all methods that call Arrays.ensureFromTo
is exclusive. Therefore, Arrays.ensureFromTo
should be fixed.
According to the implementation of Java's Objects.checkFromToIndex
, the range from 0 (inclusive) to 0 (exclusive) is valid, which means it doesn't include any indices. Arrays.ensureFromTo
does not need to be changed, but its javadoc is incorrect.
I'm totally confused. What did you fix codewise (forget the docs for the moment)?
The problem lies in my removeElements
method, where I mistakenly filled the object array. This has been fixed in #327. However, the javadoc description for the to
parameter in Arrays.ensureFromTo
is incorrect, which misled me into thinking the issue was in its implementation. The Arrays.ensureFromTo
original javadoc is as follows:
https://github.com/vigna/fastutil/blob/5eaffbbd849ec8b034046e4a8b1a0cc410457778/src/it/unimi/dsi/fastutil/Arrays.java#L52-L53
https://github.com/vigna/fastutil/blob/5eaffbbd849ec8b034046e4a8b1a0cc410457778/src/it/unimi/dsi/fastutil/Arrays.java#L63-L65
It should be:
* @param to an end index (
inclusive exclusive).
Ahh OK, I see. The text is correct. The parameter description is wrong.
WTF, I have the new release but Sonatype gives me Unauthorized (401). Nothing has changed in my configuration since 5 months ago. If someone has suggestions, let me know, I tried on the community forum but I just met other people with the same problem...
Sonatype now enforces token authentication instead of username and password. Refer to the user manual to generate your token.
Thanks for the tip! They could provide a more sensible error message, frankly (or I'm just getting addicted to the Rust ecosystem).
In the previous pull request (#311), I optimized the array field access of some classes. Recently, with my deeper understanding of fastutil, I have discovered more optimizations and technical debts that need attention. However, resolving these issues is challenging and time-consuming, requiring contributions from the community and multiple pull requests. This pull request is the recent work I have done on ArrayList.