gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java
https://www.morling.dev/blog/one-billion-row-challenge/
Apache License 2.0
5.69k stars 1.72k forks source link

Entry for abhinav-upadhyay #696

Closed abhinav-upadhyay closed 5 months ago

abhinav-upadhyay commented 5 months ago

Check List:

abhinav-upadhyay commented 5 months ago

So my GitHub username is abhinav-upadhyay. But a file name cannot contain -. The CI job is failing to find the script to run my program. Is there any alternative to get around it?

abhinav-upadhyay commented 5 months ago

So my GitHub username is abhinav-upadhyay. But a file name cannot contain -. The CI job is failing to find the script to run my program. Is there any alternative to get around it?

Realized, I can have file names with -. Just need to use quotes when referring to them on the shell.

gunnarmorling commented 5 months ago

Fails with the 10K key set test (see create_measurements_3.sh):

+ timeout -v 300 ./test.sh abhinav-upadhyay measurements_10K_1B.txt
Validating calculate_average_abhinav-upadhyay.sh -- measurements_10K_1B.txt
410c410
< Fort Porta;-21.8;9.3;42.0
---
> Fort Porta;-21.8;8.6;38.0
617c617
< Ki;-17.4;15.5;47.4
---
> Ki;-17.4;17.6;47.4
1020c1020
< Roch;-22.0;14.3;46.6
---
> Roch;-11.6;18.9;48.0
...
gunnarmorling commented 5 months ago

Still incorrect results for the 10K keyset. Note that we're after the cut-off time, so you'll have two more changes you can make to this PR (see note at the top of the README). If it's not working or valid then, I'll have to close it unfortunately. Updates should be pushed quickly, so I can evaluate all the pending entries. Thx!

gunnarmorling commented 5 months ago

Hey @abhinav-upadhyay, I still believe this does not handle hash collisions correctly. In case of a collision, the actual names of the existing and the incoming entry must be compared. I can't find where this is happening? You can push one more time to this PR before it'll either be merged (if valid) or closed (if invalid). Thx!

abhinav-upadhyay commented 5 months ago

Hey @abhinav-upadhyay, I still believe this does not handle hash collisions correctly. In case of a collision, the actual names of the existing and the incoming entry must be compared. I can't find where this is happening? You can push one more time to this PR before it'll either be merged (if valid) or closed (if invalid). Thx!

I am building the hash using all the bytes of the location name (lines 180-198 in readFile method). This hash is stored into a 64 bit long value. In the test cases and even in case of the 10k keys file (which contained some huge names), it is not resulting in any cases where two names have the same hash. Because of this, I have not added any checks for comparing the names themselves .

gunnarmorling commented 5 months ago

Understood, but absence of collisions in specific key sets doesn't mean that there cannot be collisions at all. As per the rules, you cannot make any assumptions on specific names:

Q: Can I make assumptions on the names of the weather stations showing up in the data set? A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no ; or \n characters).

So, in order for the entry to be considered valid, it must fall back to comparing names in case of collisions.

gunnarmorling commented 5 months ago

I am going to close this one as I haven't heard back within two days and we are after the cut-off time. Thanks for your efforts nevertheless!