seung-lab / zmesh

Marching Cubes & Mesh Simplification on multi-label 3D images.
GNU General Public License v3.0
58 stars 8 forks source link

perf: faster marching cubes #26

Closed william-silversmith closed 2 years ago

william-silversmith commented 2 years ago

Replace std::unordered_set with a highly efficient sort of 8 elements. cpp STL unordered_set uses closed addressing with chaining, which leads to inefficiencies and takes up a lot of time. In a quick experiment performing MC on connectomics.npy, the following approximate times held:

unordered_set:  29.5 sec. (1.00x)
std::sort: 20 sec (1.47x)
network sort: 18 sec (1.63x)

Thanks to @Vectorized for their stackoverflow answer on generating arbitrary network sorts in C++. https://stackoverflow.com/questions/19790522/very-fast-sorting-of-fixed-length-arrays-using-comparator-networks

ranlu commented 2 years ago

Have you tried faster hash tables, like the ones in google's abseil or facebook's folly?

william-silversmith commented 2 years ago

I considered implementing a specialized hash table as an alternative (open addressing, using bitshifts/masks instead of modulo) but this worked so well I doubt there will be a significant difference. I need to profile it though, which takes a little bit of annoying setup.

william-silversmith commented 2 years ago

This is a really bad micro benchmark, but I think the network sort is taking 125ns. With 512^3 and a chunk size of 2x2x2, that amounts to 2.1 seconds of run time, which seems about right. A better hash could maybe push that down further, below 1µs we're really getting into assembly optimization territory.

ranlu commented 2 years ago

As far as I understand, abseil is already using sse to check 16 keys everytime. this is one of their highlights https://abseil.io/about/design/swisstables#metadata-lookup

william-silversmith commented 2 years ago

This looks like a really interesting library. I would like to avoid adding another big dependency (zi_lib is already a headache). Once I do some profiling, I'll give it a look. My suspicion is that this operation is now about 10% of the total MC runtime -- big enough to be non-trivial but also small enough that to see real improvement we'd have to see a 2x performance improvement. It's gonna be pretty tough to beat the key lookup here since it's a linear scan of the array. Insertion is more important.