Closed thomaswue closed 7 months ago
It doesn't seem to make a difference on the eval machine:
Using java version 21.0.2-graal in this shell.
Validating calculate_average_thomaswue.sh -- measurements_1B.txt
Picking up existing native image 'target/CalculateAverage_thomaswue_image', delete the file to select JVM mode.
+ rm -f measurements.txt
+ ln -s measurements_1B.txt measurements.txt
Benchmark 1: timeout -v 300 ./calculate_average_thomaswue.sh 2>&1
Time (mean ± σ): 2.189 s ± 0.043 s [User: 0.002 s, System: 0.004 s]
Range (min … max): 2.075 s … 2.225 s 10 runs
Summary
thomaswue: trimmed mean 2.19823098707, raw times 2.07534146632,2.19937587432,2.16664367332,2.2091708203200002,2.22455168132,2.22013504532,2.19476398532,2.20274883332,2.20429925032,2.18871041432
Leaderboard
| # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
|---|-----------------|--------------------|-----|---------------|-----------|
| | 00:02.198 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.2-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary, uses Unsafe |
Btw. this might be interesting to you: I am seeing this exception on two other entries:
Fatal error: Failed to leave the current IsolateThread context and to detach the current thread. (code 12)
For instance here. Any idea what might be the cause and how to avoid it?
Btw. this might be interesting to you: I am seeing this exception on two other entries:
Fatal error: Failed to leave the current IsolateThread context and to detach the current thread. (code 12)
For instance here. Any idea what might be the cause and how to avoid it?
It seems it comes from here:
public static final class LeaveEpilogue implements CEntryPointOptions.Epilogue {
private static final CGlobalData<CCharPointer> errorMessage = CGlobalDataFactory.createCString(
"Failed to leave the current IsolateThread context.");
@Uninterruptible(reason = "epilogue")
static void leave() {
int code = CEntryPointActions.leave();
if (code != CEntryPointErrors.NO_ERROR) {
CEntryPointActions.failFatally(code, errorMessage.get());
}
}
}
I'll try to find the cause.
Related problem: https://github.com/quarkusio/quarkus/issues/6606#issuecomment-586340877
. (code 12)
Maybe @peter-hofer can help =)
Yes, this looks like a potential native image bug. I will check with the team.
and to detach the current thread.
It's actually here (GitHub text search is not that good ...):
public static final class LeaveDetachThreadEpilogue implements CEntryPointOptions.Epilogue {
private static final CGlobalData<CCharPointer> errorMessage = CGlobalDataFactory.createCString(
"Failed to leave the current IsolateThread context and to detach the current thread.");
@Uninterruptible(reason = "epilogue")
public static void leave() {
int code = CEntryPointActions.leaveDetachThread();
if (code != 0) {
CEntryPointActions.failFatally(code, errorMessage.get());
}
}
}
Yes, this looks like a potential native image bug. I will check with the team.
There's an open issue from Mar 24, 2021 for that same error msg and code: https://github.com/oracle/graal/issues/3305
Btw. this might be interesting to you: I am seeing this exception on two other entries:
Fatal error: Failed to leave the current IsolateThread context and to detach the current thread. (code 12)
For instance here. Any idea what might be the cause and how to avoid it?
There are at least three entries where this error happened:
https://github.com/gunnarmorling/1brc/pull/479
@gunnarmorling Could you please evaluate this new version? It consistently beats the top entry now on my machine, so with a bit of luck, it could get below 2s ;-).
It is below 2 sec indeed, and quite a bit in fact! Could you describe perhaps on a high level what is making this huge difference at this point?
Benchmark 1: timeout -v 300 ./calculate_average_thomaswue.sh 2>&1
Time (mean ± σ): 1.890 s ± 0.011 s [User: 0.002 s, System: 0.004 s]
Range (min … max): 1.859 s … 1.898 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
thomaswue: trimmed mean 1.89342727442, raw times 1.85925432142,1.8941132214199998,1.8928165554199998,1.89327190942,1.89163260842,1.89773446642,1.89233219542,1.89314723642,1.89674720142,1.89335726742
Leaderboard
| # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
|---|-----------------|--------------------|-----|---------------|-----------|
| | 00:01.893 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.2-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary, uses Unsafe |
Excellent! Below 2s. Mission accomplished ;-).
So one main change is to adopt processing segments in 2MB blocks to make sure all threads finish at the same time (otherwise other threads might sit idle while one thread finishes its part). This is inspired by @artsiomkorzun's solution. Looking at the perf flamegraph of his solution showed the benefit in wall clock time.
The other major change is to process 3 scanners in parallel in the same thread. This increases instructions per cycle, because there is more straight-line code that the Intel processor can then optimize better. The machine code generation is also more efficient, because there are quite some 64-bit long constants involved (in finding delimiters) and those are better shared if the same operation is done 3 at a time.
Awesome, really cool! Thx for the detailed info.
Excellent! Below 2s. Mission accomplished ;-).
So one main change is to adopt processing segments in 2MB blocks to make sure all threads finish at the same time (otherwise other threads might sit idle while one thread finishes its part). This is inspired by @artsiomkorzun's solution. Looking at the perf flamegraph of his solution showed the benefit in wall clock time.
The other major change is to process 3 scanners in parallel in the same thread. This increases instructions per cycle, because there is more straight-line code that the Intel processor can then optimize better. The machine code generation is also more efficient, because there are quite some 64-bit long constants involved (in finding delimiters) and those are better shared if the same operation is done 3 at a time.
Beautiful solution Thomas! :star_struck: :star_struck:
I think the main speed up came from smaller segments, and doing that inside the hot loop !, because you get the huge benefit of still using the same Result[] results = new Result[1 << 17];
I'm not sure how much performance was gained from unrolling (spliting) the scanners. But you for sure knows!!! =)
It is a bit difficult to isolate (and might be different on the benchmark machine), but from the total 15% improvement over the previous entry, it seems almost an equal contribution from both of those changes. Doing 3 in parallel definitely helps more than 5% on my machine.
Check List:
[x] Tests pass (
./test.sh <username>
shows no differences between expected and actual outputs)[x] All formatting changes by the build are committed
[x] Your launch script is named
calculate_average_<username>.sh
(make sure to match casing of your GH user name) and is executable[x] Output matches that of
calculate_average_baseline.sh
[x] For new entries, or after substantial changes: When implementing custom hash structures, please point to where you deal with hash collisions (line number)
Execution time: 0.45s
Execution time of reference implementation: 120.37s
This is some final (?) fine tuning. Should give a few more %. Next step would be to convert the hash table array into an "inlined data structure" with all unsafe access, but not sure if I want to go there ;-).
Update: Got this down to 0.41s now on my machine with processing segments at 2MB each and also processing with 3 scanners in parallel in the same thread.