map/reduce population statistics

nilsbecker commented 10 months ago

This is a possible performance enhancement for custom Smc_inference resampling schemes.

Current situation: when using multicore with ~nthreads>1 in the run functions, the population is distributed over available cores to run the model in parallel. That's fine. After that is finished all particles are synchronized and reweighting can take place over the full population.

Limitation: In cases where the reweighting actually takes a good fraction of computing time, it would be desirable to be able to parallelize that as well. This is not easy with the provided single-particle fine-grained parallelism.

Desirable situation: The particles of the population run in parallel, but within chunks so that some population statistics can be gathered per chunk; only afterwards, these are aggregated at the global synchronization barrier provided by the resampling step.

Example: estimating a variance (matrix) over the population for ABC step size tuning.

Example: sorting the population according to some order relation.

My particular case: calculating a quantile of a population, which could be parallelized using https://github.com/SGrondin/tdigest . Here, a number m (probably best: m = nthreads) so-called digests can be updated in parallel and later merged efficiently to then estimate a total quantile. One way to do that would be to update from within each particle run, but one would need to ensure that access to each digest is sequential. To ensure this one would group the particles into m work queues and associate each queue with one digest, which is updated only from particles running within that queue, necessarily sequentially. This may be achievable with the moonpool library ?

Alternative: One alternative I can think of is to separately implement parallelism within the resampling step. E.g. use P.iter to extract the total population into some global array at the beginning of resampling, and then use some manual chunking and parallel computation on that array. Probably doable but maybe more cumbersome?

nilsbecker commented 10 months ago

this may be a case of premature optimization though. in more realistic applications than the toy problems i am considering now, it's probable that the actual particle updates will become the dominant computational cost; the expensive evaluation of the ABC distance can happen in the particle threads, and the actual population-wide statistics required during reweighting will probably become inexpensive in comparison. it may be best to wait for proof that this really is a bottleneck in realistic problems.

igarnier commented 10 months ago

Letting the user perform more computations in parallel is a good thing, but I'd have to come up with an API which is not too complex/clunky.

igarnier / prbnmcn-dagger

map/reduce population statistics #11