mar-file-system / marfs

MarFS provides a scalable near-POSIX file system by using one or more POSIX file systems as a scalable metadata component and one or more data stores (object, file, etc) as a scalable data component.
Other
96 stars 27 forks source link

The current hashing scheme for multi-component storage produces bad distributions when the number of pods and capacity units is the same #188

Closed wfvining closed 7 years ago

wfvining commented 7 years ago

When selecting the location for an object in multi-component storage we generate the four-tuple (pod, capacity unit, scatter, start-block) based on a hash of the object ID. Currently this is done simplistically:

pod = H(objid) mod num_pods
capacity unit = H(objid) mod num_cap
...

This leads to a bad distribution when the number of pods is equal to the number of capacity units (or the number of possible values for any of the four elements is the same as any other).

To address this we need to use four independent hash functions, one for each element of the tuple. I propose generating four "random" hash functions from a universal family of functions. We can seed a pseudo-random number generator with the hash of the object id (as it is computed now) and use that generator to select four random functions from the "Multiply-shift" family http://www.diku.dk/summer-school-2014/course-material/mikkel-thorup/hash.pdf_copy.

Alternatively we may be able to simply seed the random number generator with the hash of the object id and generate four random number to use in the four-tuple. I am running some experiments to see which produces better results.

wfvining commented 7 years ago

Simply using C's random() family of functions does not perform as well as generating Multiply-shift hash functions (Hashing 1 billion objects to 1 Million bins gives a standard deviation of 35 over the number of objects in each bin for this approach as opposed to 30 for the multiply-shift method). Added the code to DAL in the above commit. Currently in testing.

wfvining commented 7 years ago

The implementation above is likely not thread safe. There is likely a race on the PRNG for multiple threads writing in FUSE.

wfvining commented 7 years ago

The new hashing scheme as described above using rand_r for thread safety has been merged into master.