Better parallelization algorithm for gridlink

lgarrison commented 3 years ago

Gridlink currently uses a simple parallel scheme that does two passes over all particles: one to compute cell indices, and another to write the particles into cells. Each thread is actually still reading all particles in the second pass and is just treating those in its domain to avoid race conditions.

We can do better! With a fast, parallel sort, each thread will only have to read particles in its cell domain, and the writes will be contiguous, rather than semi-random, like they are now.

The current method is still pretty fast (see timings in #239), so the challenge is writing a fast enough sort. A parallel radix sort might be the trick. We likely want this to be an argsort, since the particles can be "heavy" (12 or 24 bytes) and the indices are relatively light (8 or maybe 4 bytes if we economize). Worse, the movement of the weights requires a loop over a number of weights only known at runtime, which could bog down any swaps.

We'll want to port any such algorithm to the mock grindlink(s); the current parallel scheme is only for the theory gridlink.

lgarrison commented 3 years ago

The in-place version will need to use swaps instead of argsort. Or we'll need to figure out how to do an efficient in-place permutation. Either way, it should be parallelized.

manodeep commented 3 years ago

Another thing to do would be to find the final sorted order (i.e., account for sort_on_z) while performing this initial partitioning of particles into cells.

manodeep / Corrfunc

Better parallelization algorithm for gridlink #240