Closed wfvining closed 7 years ago
Simply using C's random()
family of functions does not perform as well as generating Multiply-shift hash functions (Hashing 1 billion objects to 1 Million bins gives a standard deviation of 35 over the number of objects in each bin for this approach as opposed to 30 for the multiply-shift method). Added the code to DAL in the above commit. Currently in testing.
The implementation above is likely not thread safe. There is likely a race on the PRNG for multiple threads writing in FUSE.
The new hashing scheme as described above using rand_r
for thread safety has been merged into master.
When selecting the location for an object in multi-component storage we generate the four-tuple (pod, capacity unit, scatter, start-block) based on a hash of the object ID. Currently this is done simplistically:
This leads to a bad distribution when the number of pods is equal to the number of capacity units (or the number of possible values for any of the four elements is the same as any other).
To address this we need to use four independent hash functions, one for each element of the tuple. I propose generating four "random" hash functions from a universal family of functions. We can seed a pseudo-random number generator with the hash of the object id (as it is computed now) and use that generator to select four random functions from the "Multiply-shift" family http://www.diku.dk/summer-school-2014/course-material/mikkel-thorup/hash.pdf_copy.
Alternatively we may be able to simply seed the random number generator with the hash of the object id and generate four random number to use in the four-tuple. I am running some experiments to see which produces better results.