Closed william-silversmith closed 2 years ago
Have you tried faster hash tables, like the ones in google's abseil or facebook's folly?
I considered implementing a specialized hash table as an alternative (open addressing, using bitshifts/masks instead of modulo) but this worked so well I doubt there will be a significant difference. I need to profile it though, which takes a little bit of annoying setup.
This is a really bad micro benchmark, but I think the network sort is taking 125ns. With 512^3 and a chunk size of 2x2x2, that amounts to 2.1 seconds of run time, which seems about right. A better hash could maybe push that down further, below 1µs we're really getting into assembly optimization territory.
As far as I understand, abseil is already using sse to check 16 keys everytime. this is one of their highlights https://abseil.io/about/design/swisstables#metadata-lookup
This looks like a really interesting library. I would like to avoid adding another big dependency (zi_lib is already a headache). Once I do some profiling, I'll give it a look. My suspicion is that this operation is now about 10% of the total MC runtime -- big enough to be non-trivial but also small enough that to see real improvement we'd have to see a 2x performance improvement. It's gonna be pretty tough to beat the key lookup here since it's a linear scan of the array. Insertion is more important.
Replace
std::unordered_set
with a highly efficient sort of 8 elements. cpp STLunordered_set
uses closed addressing with chaining, which leads to inefficiencies and takes up a lot of time. In a quick experiment performing MC on connectomics.npy, the following approximate times held:Thanks to @Vectorized for their stackoverflow answer on generating arbitrary network sorts in C++. https://stackoverflow.com/questions/19790522/very-fast-sorting-of-fixed-length-arrays-using-comparator-networks