Open datdenkikniet opened 6 months ago
PR welcome for this change to the generator. Note that I am already using the same measurements.txt file for evaluating all entries, i.e. fairness is ensured.
I've now opened a PR that adds this functionality in #149. Also puts in a little bit of ground work to hopefully make #125 a bit easier to use generically by hiding/putting WeatherStation
in its own class.
The evaluation shouldn't use a public test file because that allows the contenders to tightly optimize for the exact keyset in that file. For example, tweaking the hash function to minimize collisions, having special cases for some keys, sizing everything exactly right for the keyset, etc.
If there's concern that some solution may just get unlucky with a given keyset, the winner can be determined by repeating the test with 2-3 different test files. I very much doubt that this would be a factor, given the large keyset size (10,000); more noise can be expected from all the environmental factors on the test machine.
You are absolutely correct, and I agree! I do not think that the current test-file should be shared or changed, but am asking for determinism so that it becomes a lot easier to compare/run on 1 billion row files that other contestants are using without requiring transmission of the entire data file.
Currently it seems that the 1 billion rows file is generated randomly. Making the generation pseudorandom would make sharing the 1 billion row file a little easier (since it should always be the same), and would make sure that everyone is running exactly the same test.
Just using a
Random
with a predefined seed to pick out stations, and seeding aRandom
with the hash code of the city name to obtain measurements should do the trick.