Parallelization Design Notes

tskisner commented 6 years ago

This issue is to track thoughts / discussion on parallelization in the codebase. TOAST v1.0 was a pure C++ code that used MPI to distribute data and then for each process the detectors were distributed over threads and each thread did some operation for its assigned detectors (simulate noise, compute detector quaternions, compute pixel indices and weights, accumulate diagonal covariance, etc).

When we moved to Python for TOAST 2.0, we were not able to thread over detectors effectively, due to the Global Interpreter Lock (GIL). However, since that transition, we now build an internal C++ library for math operations. That code bypasses the GIL, so opens the door to using threads internally. However, our current design loops over detectors inside the python code and for each detector, "does something". That operation might be threaded compiled code, but those operations might not be computationally intensive enough to justify the overhead of threading.

There are several aspects (and consequences) to moving threading to a higher level than the current "do stuff with one detector". Here are some options and their impact (particularly relevant to #105 ):

Each TOAST "operator" has access to a set of detectors distributed over an MPI communicator. That operator can choose how it wants to do things under the hood. If we were to create a C/C++ function that took multiple detectors as its inputs, then a specific operator could prepare all needed inputs for the whole set of detectors and call this compiled function to do threading over detectors. IMPACT: any operator like this would not be able to effectively use threads if it was run in a chain of operations on a single detector (#105).
Addendum to the previous point. Operators have an exec() method that takes Python objects as arguments. Through the use of pybind11, one could write operators themselves in pure C++ (with the caveat of needing to unpack things a bit to access the distributed data streams). This would be hit by the same penalty as the previous point in that such operators chained together working on one detector would not be using threads.
Python does support "threads" (either with the threading module or concurrent.futures). However such threads are effectively continuously battling for the GIL whenever the thread is executing python code. When such a thread calls compiled code, it releases the GIL and all threads can run in parallel until they return to Python code. So, one option is to keep the compiled code "low-level", and thread across detectors in Python inside the operator classes. Such threads will see serial performance until then enter compiled code. IMPACT: For operators run on a collection of detectors simultaneously, this solution may be less performant that the previous points- depends on the cost of going to/from compiled code for each detector rather than looping in C++. When executing the compiled code for a single detector, that performance should be the same. This option also keeps open the possibility of chaining single-detector operations that call compiled code with those that do not. Obviously threads will only run in parallel when they enter compiled code.

tskisner commented 6 years ago

TOAST threading plan

tskisner commented 4 years ago

This discussion is somewhat stale. As of April 2020 we currently do have several operators that use C++ under the hood to thread over detectors (and this code is called from python via pybind11). In practice this is most critical for a handful of operators. Moving forward, some operators will go a step further and use CUDA operations if available.

hpc4cmb / toast

Parallelization Design Notes #126