gunnarmorling / 1brc

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java
https://www.morling.dev/blog/one-billion-row-challenge/
Apache License 2.0
5.62k stars 1.71k forks source link

Should the 1 billion row file be deterministic? #35

Open datdenkikniet opened 6 months ago

datdenkikniet commented 6 months ago

Currently it seems that the 1 billion rows file is generated randomly. Making the generation pseudorandom would make sharing the 1 billion row file a little easier (since it should always be the same), and would make sure that everyone is running exactly the same test.

Just using a Random with a predefined seed to pick out stations, and seeding a Random with the hash code of the city name to obtain measurements should do the trick.

gunnarmorling commented 6 months ago

PR welcome for this change to the generator. Note that I am already using the same measurements.txt file for evaluating all entries, i.e. fairness is ensured.

datdenkikniet commented 6 months ago

I've now opened a PR that adds this functionality in #149. Also puts in a little bit of ground work to hopefully make #125 a bit easier to use generically by hiding/putting WeatherStation in its own class.

mtopolnik commented 6 months ago

The evaluation shouldn't use a public test file because that allows the contenders to tightly optimize for the exact keyset in that file. For example, tweaking the hash function to minimize collisions, having special cases for some keys, sizing everything exactly right for the keyset, etc.

mtopolnik commented 6 months ago

If there's concern that some solution may just get unlucky with a given keyset, the winner can be determined by repeating the test with 2-3 different test files. I very much doubt that this would be a factor, given the large keyset size (10,000); more noise can be expected from all the environmental factors on the test machine.

datdenkikniet commented 6 months ago

You are absolutely correct, and I agree! I do not think that the current test-file should be shared or changed, but am asking for determinism so that it becomes a lot easier to compare/run on 1 billion row files that other contestants are using without requiring transmission of the entire data file.