This adds the NewParallel() and corresponding functionality to construct the BBHash using multiple goroutines. Sadly, it only gives a marginal performance benefit, when running with multiple goroutines, in the 20-40% range, even on an M2 Max with 12 cores. The problem is that the algorithm itself has poor cache locality (for random keys); it needs to access the bit vector at random positions, and for large key sets, this means reaching across a wide bit vector that won't fit in cache.
This adds the NewParallel() and corresponding functionality to construct the BBHash using multiple goroutines. Sadly, it only gives a marginal performance benefit, when running with multiple goroutines, in the 20-40% range, even on an M2 Max with 12 cores. The problem is that the algorithm itself has poor cache locality (for random keys); it needs to access the bit vector at random positions, and for large key sets, this means reaching across a wide bit vector that won't fit in cache.