gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java
https://www.morling.dev/blog/one-billion-row-challenge/
Apache License 2.0
6.08k stars 1.83k forks source link

Some fine tuning for thomaswue #606

Closed thomaswue closed 7 months ago

thomaswue commented 7 months ago

Check List:

This is some final (?) fine tuning. Should give a few more %. Next step would be to convert the hash table array into an "inlined data structure" with all unsafe access, but not sure if I want to go there ;-).

Update: Got this down to 0.41s now on my machine with processing segments at 2MB each and also processing with 3 scanners in parallel in the same thread.

gunnarmorling commented 7 months ago

It doesn't seem to make a difference on the eval machine:

Using java version 21.0.2-graal in this shell.
Validating calculate_average_thomaswue.sh -- measurements_1B.txt
Picking up existing native image 'target/CalculateAverage_thomaswue_image', delete the file to select JVM mode.

+ rm -f measurements.txt
+ ln -s measurements_1B.txt measurements.txt
Benchmark 1: timeout -v 300 ./calculate_average_thomaswue.sh 2>&1
  Time (mean ± σ):      2.189 s ±  0.043 s    [User: 0.002 s, System: 0.004 s]
  Range (min … max):    2.075 s …  2.225 s    10 runs

Summary
  thomaswue: trimmed mean 2.19823098707, raw times 2.07534146632,2.19937587432,2.16664367332,2.2091708203200002,2.22455168132,2.22013504532,2.19476398532,2.20274883332,2.20429925032,2.18871041432

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:02.198 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.2-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary, uses Unsafe |
gunnarmorling commented 7 months ago

Btw. this might be interesting to you: I am seeing this exception on two other entries:

Fatal error: Failed to leave the current IsolateThread context and to detach the current thread. (code 12)

For instance here. Any idea what might be the cause and how to avoid it?

tivrfoa commented 7 months ago

Btw. this might be interesting to you: I am seeing this exception on two other entries:

Fatal error: Failed to leave the current IsolateThread context and to detach the current thread. (code 12)

For instance here. Any idea what might be the cause and how to avoid it?

It seems it comes from here:

https://github.com/oracle/graal/blob/9d05067f874b7c00f75b7f62cfea4a64eca93e33/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/c/function/CEntryPointSetup.java#L82

    public static final class LeaveEpilogue implements CEntryPointOptions.Epilogue {
        private static final CGlobalData<CCharPointer> errorMessage = CGlobalDataFactory.createCString(
                        "Failed to leave the current IsolateThread context.");

        @Uninterruptible(reason = "epilogue")
        static void leave() {
            int code = CEntryPointActions.leave();
            if (code != CEntryPointErrors.NO_ERROR) {
                CEntryPointActions.failFatally(code, errorMessage.get());
            }
        }
    }

I'll try to find the cause.

Related problem: https://github.com/quarkusio/quarkus/issues/6606#issuecomment-586340877

tivrfoa commented 7 months ago

. (code 12)

Maybe @peter-hofer can help =)

thomaswue commented 7 months ago

Yes, this looks like a potential native image bug. I will check with the team.

tivrfoa commented 7 months ago

and to detach the current thread.

It's actually here (GitHub text search is not that good ...):

https://github.com/oracle/graal/blob/9d05067f874b7c00f75b7f62cfea4a64eca93e33/substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/c/function/CEntryPointSetup.java#L95

    public static final class LeaveDetachThreadEpilogue implements CEntryPointOptions.Epilogue {
        private static final CGlobalData<CCharPointer> errorMessage = CGlobalDataFactory.createCString(
                        "Failed to leave the current IsolateThread context and to detach the current thread.");

        @Uninterruptible(reason = "epilogue")
        public static void leave() {
            int code = CEntryPointActions.leaveDetachThread();
            if (code != 0) {
                CEntryPointActions.failFatally(code, errorMessage.get());
            }
        }
    }
tivrfoa commented 7 months ago

Yes, this looks like a potential native image bug. I will check with the team.

There's an open issue from Mar 24, 2021 for that same error msg and code: https://github.com/oracle/graal/issues/3305

tivrfoa commented 7 months ago

Btw. this might be interesting to you: I am seeing this exception on two other entries:

Fatal error: Failed to leave the current IsolateThread context and to detach the current thread. (code 12)

For instance here. Any idea what might be the cause and how to avoid it?

There are at least three entries where this error happened:

https://github.com/gunnarmorling/1brc/pull/479

https://github.com/gunnarmorling/1brc/pull/510

https://github.com/gunnarmorling/1brc/pull/575

thomaswue commented 7 months ago

@gunnarmorling Could you please evaluate this new version? It consistently beats the top entry now on my machine, so with a bit of luck, it could get below 2s ;-).

gunnarmorling commented 7 months ago

It is below 2 sec indeed, and quite a bit in fact! Could you describe perhaps on a high level what is making this huge difference at this point?

Benchmark 1: timeout -v 300 ./calculate_average_thomaswue.sh 2>&1
  Time (mean ± σ):      1.890 s ±  0.011 s    [User: 0.002 s, System: 0.004 s]
  Range (min … max):    1.859 s …  1.898 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  thomaswue: trimmed mean 1.89342727442, raw times 1.85925432142,1.8941132214199998,1.8928165554199998,1.89327190942,1.89163260842,1.89773446642,1.89233219542,1.89314723642,1.89674720142,1.89335726742

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:01.893 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.2-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary, uses Unsafe |
thomaswue commented 7 months ago

Excellent! Below 2s. Mission accomplished ;-).

So one main change is to adopt processing segments in 2MB blocks to make sure all threads finish at the same time (otherwise other threads might sit idle while one thread finishes its part). This is inspired by @artsiomkorzun's solution. Looking at the perf flamegraph of his solution showed the benefit in wall clock time.

The other major change is to process 3 scanners in parallel in the same thread. This increases instructions per cycle, because there is more straight-line code that the Intel processor can then optimize better. The machine code generation is also more efficient, because there are quite some 64-bit long constants involved (in finding delimiters) and those are better shared if the same operation is done 3 at a time.

gunnarmorling commented 7 months ago

Awesome, really cool! Thx for the detailed info.

tivrfoa commented 7 months ago

Excellent! Below 2s. Mission accomplished ;-).

So one main change is to adopt processing segments in 2MB blocks to make sure all threads finish at the same time (otherwise other threads might sit idle while one thread finishes its part). This is inspired by @artsiomkorzun's solution. Looking at the perf flamegraph of his solution showed the benefit in wall clock time.

The other major change is to process 3 scanners in parallel in the same thread. This increases instructions per cycle, because there is more straight-line code that the Intel processor can then optimize better. The machine code generation is also more efficient, because there are quite some 64-bit long constants involved (in finding delimiters) and those are better shared if the same operation is done 3 at a time.

Beautiful solution Thomas! :star_struck: :star_struck:

I think the main speed up came from smaller segments, and doing that inside the hot loop !, because you get the huge benefit of still using the same Result[] results = new Result[1 << 17];

I'm not sure how much performance was gained from unrolling (spliting) the scanners. But you for sure knows!!! =)

thomaswue commented 7 months ago

It is a bit difficult to isolate (and might be different on the benchmark machine), but from the total 15% improvement over the previous entry, it seems almost an equal contribution from both of those changes. Doing 3 in parallel definitely helps more than 5% on my machine.