Parallel gridlink - Githubissues

lgarrison commented 3 years ago

Hey @manodeep, here's a stab at a parallel implementation of gridlink! It was motivated by a recent example that popped up in my research where the actual pair counting was very fast (due to low sample density and low rmax), but gridlink took about the same amount of time as the counting, due to the large Nmesh, and thus large number of mallocs and single-threaded particle writes. This version does one malloc for the whole grid (in detail, one malloc for each X/Y/Z/W column of particle data), calculates cell assignments in parallel, then uses that information to do parallel writes of the particle data into the appropriate cells.

Because this parallelization requires two reads of the particle data instead of one, I expected that the new version running with one thread might be slower than the old single-threaded code, especially at high particles-per-cell where the malloc overhead is lower. However, that does not appear to be the case in benchmarks. The reads are probably cheap compared to the writes, and it's possible that there are vectorization opportunities, e.g. in the computation of the cell index, that are present in this version that were not present in the old version.

Old:

   6.0 ms (Ngrid   10, N     100000, ppc    100,  1 threads)
  54.4 ms (Ngrid   10, N    1000000, ppc   1000,  1 threads)
 556.5 ms (Ngrid   10, N   10000000, ppc  10000,  1 threads)
5166.1 ms (Ngrid   10, N  100000000, ppc 100000,  1 threads)
  68.9 ms (Ngrid   20, N    1000000, ppc    125,  1 threads)
 137.3 ms (Ngrid   40, N    1000000, ppc     15,  1 threads)
 313.0 ms (Ngrid  100, N    1000000, ppc      1,  1 threads)

New:

   4.4 ms (Ngrid   10, N     100000, ppc    100,  1 threads)
  34.5 ms (Ngrid   10, N    1000000, ppc   1000,  1 threads)
 338.9 ms (Ngrid   10, N   10000000, ppc  10000,  1 threads)
3472.7 ms (Ngrid   10, N  100000000, ppc 100000,  1 threads)
  46.6 ms (Ngrid   20, N    1000000, ppc    125,  1 threads)
  75.2 ms (Ngrid   40, N    1000000, ppc     15,  1 threads)
 150.1 ms (Ngrid  100, N    1000000, ppc      1,  1 threads)

New, parallel:

   2.3 ms (Ngrid   10, N     100000, ppc    100, 12 threads)
   8.5 ms (Ngrid   10, N    1000000, ppc   1000, 12 threads)
  70.5 ms (Ngrid   10, N   10000000, ppc  10000, 12 threads)
 721.8 ms (Ngrid   10, N  100000000, ppc 100000, 12 threads)
   9.9 ms (Ngrid   20, N    1000000, ppc    125, 12 threads)
  11.0 ms (Ngrid   40, N    1000000, ppc     15, 12 threads)
  32.8 ms (Ngrid  100, N    1000000, ppc      1, 12 threads)

These are direct benchmarks of gridlink_double() (i.e. calling directly with C code; mean of 10 runs) on a 12 core, 2 socket Intel machine. No sort_on_z, since that part is unchanged between the old and new versions. The thread scaling is far from perfect, but it's a nice little speedup. I expect this will be most helpful in latency-sensitive applications like HOD evaluations. The benchmark data is uniform random, but I've spot-checked with clustered data, which looks similar (and may be an even bigger win, because no more reallocs).

The tests all pass, and I've checked with and without OpenMP and copy_particles. Could you take a look and let me know what you think?

lgarrison commented 3 years ago

I'll note that I think an even faster version would be possible if the threads were doing sequential writes and random reads, instead of random writes and sequential reads. But the sequential write version requires a fast, parallel argsort of the cell indices. I couldn't find one readily available in C, and I think the current version is fast enough that gridlink is unlikely to be a bottleneck.

Also, astropy bot appears to be afk...?

manodeep commented 3 years ago

@lgarrison Think we are almost there to a parallel counting sort - that might be a cleaner + faster solution. I will take a look regardless - changing the serial-ness of gridlink would be great. (I have long had goals for creating an OpenMP version of sglib.h but the amount of macro gymnastics required to support both OpenMP and non-OpenMP has been too big of a hurdle!)

And yup - think astropy-bot is deprecated/non-functional - at least based on the (now archived) astropy-bot channel on the astropy slack. We might have to setup a different solution through GH actions.

lgarrison commented 3 years ago

Thanks! This approach indeed might just be a hack until we have time to plan a better version.

Does a counting sort actually work for this problem? My understanding was that counting sort worked by counting integer occurrences, sort of "compressing" the information into a histogram, and therefore losing the original indices. I guess a radix argsort might work since that preserves the original keys in the radix buckets, but I couldn't find a good parallel radix argsort in C. Definitely worth exploring though; we could write one ourselves.

lgarrison commented 3 years ago

It seems that astropy bot may technically still be alive, but is in limbo porting between Heroku and GitHub Actions, based on https://github.com/astropy/astropy-bot/issues/122 and https://github.com/astropy/astropy-bot/issues/125

manodeep commented 3 years ago

@lgarrison Can you check what happens if you change the sort to a heap-sort and enable parallelism through tasks (by editing the sglib.h)? Think if you add #pragma omp task here and here, a #pragma omp parallel and a #pragma omp single on the top of the first invocation for the heapsort, and a #pragma omp taskwait here, then it might all work out

lgarrison commented 3 years ago

Just tried it; doesn't seem to sort correctly, even single threaded, unless I put a taskwait inside both loops. Looking at the HEAP_DOWN source, I think the different tasks/loop iterations are racing with each other (e.g. the _l_ variable in some iterations will be equal to the _i_ variable in others).

Single threaded (without any OpenMP), the heap sort runs at about 200 ms for 1e6 particles (not counting the time to compute the indices). Given that the implementation in this PR runs at 35 to 150 ms with 1 thread on 1e6 particles (depending on ppc), I think a sorting-based gridlink will have to use a really, really fast sort. Maybe a radix sort will be fast enough.

It would be interesting to test the speed of a radix argsort versus a radix sort that swaps particles as it goes. My guess is that the argsort version will be faster, just because the data movement is smaller during the sort, and the "heavy" movement is saved until the end.

manodeep commented 3 years ago

@lgarrison Took a quick look through the updates - this is a great effort! Think I mostly understand what's going on and left some comments where I was not sure. I can totally see that the overall data movement is reduced and this will be faster for most cases. I am assuming that you have checked that the parallel_cumsum implementation works for edge-cases of combinations of Nthreads > totncells, Nthreads < totncells, and Nthreads == totncells.

Ohh and did you check that there are no memory leaks/memory access violations with the -fsanitize family of flags

lgarrison commented 3 years ago

I've applied the changes from above and added clarifying comments in places. I double-checked parallel_cumsum for all the cases you mentioned, and it looks good (in detail, the function limits nthreads < ncell so that the algorithm never has to worry about empty threads). I checked -fsanitize=address -fsanitize=leak -fsanitize=undefined, and it passes with no errors.

Ready for merge? I'll add a changelog entry.

lgarrison commented 3 years ago

(sorry, the merge commits got a little messy at the end there, might want to squash and merge)

manodeep commented 3 years ago

@lgarrison There are a bunch of things that require my immediate attention. I am planning to come back to this later in the week and give it one final checkup before merging. I did have a few related ideas:

Does the #pragma omp simd for work correctly for the loop in parallel_cumsum? I have never used that pragma but seems like it might be useful
That loop over the entire particle list by all threads is bothering me - do you think there is a way to create a thread-local lower and upper bounds for the particles within the thread-specificc range of cell-indices during the first pass over all particles?
Is it feasible to extend the parallel gridlink to DDtheta? (Think all other pair-counters use the vanilla gridlink but I could be wrong)

Thanks again!

lgarrison commented 3 years ago

@lgarrison There are a bunch of things that require my immediate attention. I am planning to come back to this later in the week and give it one final checkup before merging. I did have a few related ideas:

Thanks, no worries!

1. Does the `#pragma omp simd for` work correctly for the loop in `parallel_cumsum`? I have never used that pragma but seems like it might be useful

Not sure! I can give it a try. It may be hard to measure a difference though; that part of the code is really fast.

2. That loop over the entire particle list by all threads is bothering me - do you think there is a way to create a thread-local lower and upper bounds for the particles within the thread-specificc range of cell-indices during the first pass over all particles?

Yes, I agree, it's a very unusual way to go about parallelization! It was not at all obvious to me that it would work as well as it seems to. I think it's just a hack until we can write a sorting-based method, which would be much more elegant. Since that's the case, I'm inclined not to optimize it further.

3. Is it feasible to extend the parallel gridlink to `DDtheta`? (Think all other pair-counters use the vanilla gridlink but I could be wrong)

It looks like gridlink_mocks_DOUBLE is used by mocks/DDsmu and mocks/DDrppi, and gridlink_mocks_theta_ra_dec_DOUBLE is used by mocks/DDtheta. I think I am out of time right now to port this over to the other gridlink implementations; we may be happier waiting until we come up with a sorting-based method in any case. I will open an issue to track this.

Thanks again!

No problem! Was fun to write.

manodeep commented 3 years ago

And will you please run the tests to check for invalid memory usage with the following:

make distclean
export CFLAGS =-fsanitize=leak -fsanitize=undefined -fsanitize=bounds -fsanitize=address -fsanitize-address-use-after-scope -fsanitize-undefined-trap-on-error -fstack-protector-all
export CLINK =-fsanitize=leak -fsanitize=undefined -fsanitize=bounds -fsanitize=address -fsanitize-address-use-after-scope -fsanitize-undefined-trap-on-error -fstack-protector-all
make tests

If you are feeling particularly cautious then feel free to add -DINTEGRATION_TESTS into CFLAGS and CLINK (heads-up: such tests take 2+ days to complete under on a single node on our local supercomputer)

lgarrison commented 3 years ago

Just tested; the core theory and mock tests look clean under the sanitizers.

The Python bindings are proving harder to test with sanitizers. They complain about libasan not being loaded first, and when I fix that with LD_PRELOAD, I end up with a bunch of Python-level errors like:

Direct leak of 425383 byte(s) in 292 object(s) allocated from:
    #0 0x155554a2655f in __interceptor_malloc /home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libsanitizer/asan/asan_malloc_linux.cc:144
    #1 0x555555679551 in _PyObject_Malloc /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/obmalloc.c:1628
    #2 0x555555679551 in PyObject_Malloc /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/obmalloc.c:685

They all look like Python-internal things; I don't see any traces related to Corrfunc. I tested the master branch, and it shows the same errors. So I think the errors are unrelated to this PR.

manodeep commented 3 years ago

Ohh I have never ran the memcheck on the python API. That's not great that there are memory leaks (but obviously not relevant to this PR)

@lgarrison All good to squash-merge the PR?

lgarrison commented 3 years ago

Yes, ready!

pllim commented 3 years ago

Re: astropy-bot -- Hello! Please open issues at https://github.com/astropy/astropy-bot for how the bot isn't working for you. I do notice that it is flaky in astropy but it still works most of the time...

lgarrison commented 3 years ago

@pllim Astropy bot did eventually kick in; I see the checks as passing on this PR. I don't see the usual comment from the bot in this thread, but maybe that's because it happened to be down when we opened the PR. We'll keep an eye out on the next PR!

manodeep commented 3 years ago

@lgarrison The gridlink-parallel branch can now be deleted right?

lgarrison commented 3 years ago

Yep! I'll go ahead and do that.

manodeep / Corrfunc

Parallel gridlink #239