gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java
https://www.morling.dev/blog/one-billion-row-challenge/
Apache License 2.0
5.69k stars 1.73k forks source link

CA_vaidhy: 1.9 seconds vs 3 minute 20 seconds. (4.7 seconds on 10K) #699

Closed anitasv closed 5 months ago

anitasv commented 5 months ago

Check List:

Collision Handling:

Line 73-86 is the first of the many in the code base. The way it works is by imagining a string a list of longs (8 bytes), since it is not always a multiple of 8, an extra long with padded zeroes are kept called "suffix". So we check 8 bytes until the multiple, and check the extra suffix. This ensures that we don't have to check last few bytes. In case string is lesser than 8 bytes then only suffix check is enough because length 7 will indicate null will be part of suffix, which can't be part of UTF-8. And when length <=8 hash is already suffix because we use xor to create hash. So no need to check suffix as well.

10K keys dataset.

It also matches the output, had to increase hash size to twice we had for better performance.

gunnarmorling commented 5 months ago

Could you rebase and squash this into a single commit off of current main? Thx!

anitasv commented 5 months ago

Opened another PR: https://github.com/gunnarmorling/1brc/pull/708