Initial submission for jonathan_aotearoa.

gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

https://www.morling.dev/blog/one-billion-row-challenge/

Apache License 2.0

6.08k stars 1.83k forks source link

Initial submission for jonathan_aotearoa. #586

Closed jonathan-aotearoa closed 7 months ago

jonathan-aotearoa commented 7 months ago

Check List:

[x] Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
[x] All formatting changes by the build are committed
[x] Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
[x] Output matches that of calculate_average_baseline.sh
[x] For new entries, or after substantial changes: When implementing custom hash structures, please point to where you deal with hash collisions (line number)
Execution time:
Execution time of reference implementation:

jonathan-aotearoa commented 7 months ago

Hash collisions are dealt with on line 305 in the Repository findIndex method.

jonathan-aotearoa commented 7 months ago

Execution times on my machine:

./calculate_average_baseline.sh 125.47s user 6.96s system 100% cpu 2:11.29 total
./calculate_average_jonathanaotearoa.sh 17.45s user 2.03s system 766% cpu 2.540 total

jonathan-aotearoa commented 7 months ago

@gunnarmorling - The issue with my failed build appears to be due to my GitHub username having a hyphen in it. I've updated my prepare and calculate shell scripts so they're named in accordance with my GitHub username. I've obviously left the class name unaltered. Hopefully this fixes the build issue.

gunnarmorling commented 7 months ago

One of the tests is failing (see CI for details).

jonathan-aotearoa commented 7 months ago

One of the tests is failing (see CI for details).

It looks like it was passing locally for me because the measurements files always ended with a new line character, and that no longer appears to be the case.

jonathan-aotearoa commented 7 months ago

@gunnarmorling - I couldn't reproduce the build error locally. I've made some changes in the hope of fixing whatever the root cause is/was.

jonathan-aotearoa commented 7 months ago

@gunnarmorling - I've found the issue. Please wait for an additional commit before re-running the tests. Thanks!

jonathan-aotearoa commented 7 months ago

@gunnarmorling - It should be good to go now.

jonathan-aotearoa commented 7 months ago

@gunnarmorling - I've updated the collision check to include checking the value of both names in memory. See line 484.

gunnarmorling commented 7 months ago

Yepp, this looks better now. Results are still off for the 10K key set test though (see create_measurements3.sh):

Validating calculate_average_jonathan-aotearoa.sh -- measurements_10K_1B.txt
Using native image 'target/CalculateAverage_jonathan-aotearoa_image'. Delete this file to select JVM mode.
88c88
< Alco;-15.0;14.6;46.2
---
> Alco;-14.5;17.3;47.1
94c94
< Alot;-10.2;18.1;48.6
---
> Alot;-12.1;17.1;48.5

...

FAILURE: ./test.sh jonathan-aotearoa measurements_10K_1B.txt failed

gunnarmorling commented 7 months ago

Still seeing a (much smaller) diff for the 10K key set after the latest update:

Validating calculate_average_jonathan-aotearoa.sh -- measurements_10K_1B.txt
Using native image 'target/CalculateAverage_jonathan-aotearoa_image'. Delete this file to select JVM mode.
234c234
< Bālu;-18.3;15.0;50.9
---
> Bālu;-8.4;19.8;52.0
982c982
< Pālg;-23.5;15.0;52.0
---
> Pālg;-23.5;10.2;41.0

FAILURE: ./test.sh jonathan-aotearoa measurements_10K_1B.txt failed
Summary
  jonathan-aotearoa: command failed or output did not match

jonathan-aotearoa commented 7 months ago

@gunnarmorling - Apologies, I had a bug in my rounding code. Should be OK now. I've tested it against 10k versions of measurements2.txt and measurements3.txt

jonathan-aotearoa commented 7 months ago

Output from my TestRunner

Testing 'measurements-10000-unique-keys.txt'... Passed
Testing 'measurements-boundaries.txt'... Passed
Testing 'measurements-rounding.txt'... Passed
Testing 'measurements-dot.txt'... Passed
Testing 'measurements-3-10k.txt'... Passed
Testing 'measurements-short.txt'... Passed
Testing 'measurements-10.txt'... Passed
Testing 'measurements-2-10k.txt'... Passed
Testing 'measurements-complex-utf8.txt'... Passed
Testing 'measurements-2.txt'... Passed
Testing 'measurements-shortest.txt'... Passed
Testing 'measurements-3.txt'... Passed
Testing 'measurements-20.txt'... Passed
Testing 'measurements-1.txt'... Passed

gunnarmorling commented 7 months ago

Still failing:

Validating calculate_average_jonathan-aotearoa.sh -- measurements_10K_1B.txt
Using native image 'target/CalculateAverage_jonathan-aotearoa_image'. Delete this file to select JVM mode.
234c234
< Bālu;-18.3;15.0;50.9
---
> Bālu;-8.4;19.8;52.0
982c982
< Pālg;-23.5;15.0;52.0
---
> Pālg;-23.5;10.2;41.0

FAILURE: ./test.sh jonathan-aotearoa measurements_10K_1B.txt failed
Summary
  jonathan-aotearoa: command failed or output did not match

Note how max/min are off, can't just be rounding.

jonathan-aotearoa commented 7 months ago

This is really strange. I'm generating data using ./create_measurements3.sh 10000 and comparing my output with the output from CalculateAverage_baseline. I've tried this several times just now and it's always identical.

jonathan-aotearoa commented 7 months ago

Is there another command I can run to generate data similar to the build?

gunnarmorling commented 7 months ago

I reckon it's caused by running the tests with 32 cores on the eval machine which may somehow trip up the partitioning logic.

Message ID: @.***>

jonathan-aotearoa commented 7 months ago

OK, I see the problem. I still have an issue with my collision checking. I just generated a 200k line file and I got 4 mismatches.

jonathan-aotearoa commented 7 months ago

./create_measurements3.sh 200000 is now passing. I've also tested ./create_measurements3.sh 1000000. Thanks for your patience with this, and thank you for taking the time to set up and administer the challenge. It's been a great learning experience.

gunnarmorling commented 7 months ago

Yepp, all good now. 00:05.077 on the standard key set and the 10K one passes to (in 00:07.499).