Simplify station dedupe

gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

https://www.morling.dev/blog/one-billion-row-challenge/

Apache License 2.0

6.08k stars 1.83k forks source link

Simplify station dedupe #589

Closed ianopolousfast closed 7 months ago

ianopolousfast commented 7 months ago

Simplify dedupe station to reduce branches. Use sub process trick to avoid mem unmap cost.

Check List:

[X] Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
[X] All formatting changes by the build are committed
[X] Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
[X] Output matches that of calculate_average_baseline.sh
[X] For new entries, or after substantial changes: When implementing custom hash structures, please point to where you deal with hash collisions (line number)
Execution time: 13.8s
Execution time of reference implementation: 288s

gunnarmorling commented 7 months ago

Not sure why, but it's a bit slower (ran it multiple times, same result):

Benchmark 1: timeout -v 300 ./calculate_average_ianopolousfast.sh 2>&1
  Time (mean ± σ):      5.563 s ±  0.141 s    [User: 39.906 s, System: 0.755 s]
  Range (min … max):    5.351 s …  5.692 s    5 runs

Summary
  ianopolousfast: trimmed mean 5.59115306116, raw times 5.69155619016,5.35070126916,5.598124665159999,5.50029568816,5.67503883016

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:05.591 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ianopolousfast.java)| 21.0.1-open | [Dr Ian Preston](https://github.com/ianopolousfast) |  |

ianopolousfast commented 7 months ago

Hmm maybe the sub process trick is actually a negative without graal AOT as you pay the JVM startup cost twice. I've removed it here now. Could you try again @gunnarmorling Thank you.

gunnarmorling commented 7 months ago

00:05.387 now, i.e. within the fault tolerance of what I can measure on that env. Gonna update the leaderboard with that value.