Open nilsbecker opened 9 months ago
this may be a case of premature optimization though. in more realistic applications than the toy problems i am considering now, it's probable that the actual particle updates will become the dominant computational cost; the expensive evaluation of the ABC distance can happen in the particle threads, and the actual population-wide statistics required during reweighting will probably become inexpensive in comparison. it may be best to wait for proof that this really is a bottleneck in realistic problems.
Letting the user perform more computations in parallel is a good thing, but I'd have to come up with an API which is not too complex/clunky.
This is a possible performance enhancement for custom
Smc_inference
resampling schemes.Current situation: when using multicore with
~nthreads
>1 in therun
functions, the population is distributed over available cores to run the model in parallel. That's fine. After that is finished all particles are synchronized and reweighting can take place over the full population.Limitation: In cases where the reweighting actually takes a good fraction of computing time, it would be desirable to be able to parallelize that as well. This is not easy with the provided single-particle fine-grained parallelism.
Desirable situation: The particles of the population run in parallel, but within chunks so that some population statistics can be gathered per chunk; only afterwards, these are aggregated at the global synchronization barrier provided by the resampling step.
Example: estimating a variance (matrix) over the population for ABC step size tuning.
Example: sorting the population according to some order relation.
My particular case: calculating a quantile of a population, which could be parallelized using https://github.com/SGrondin/tdigest . Here, a number
m
(probably best:m = nthreads
) so-called digests can be updated in parallel and later merged efficiently to then estimate a total quantile. One way to do that would be to update from within each particle run, but one would need to ensure that access to each digest is sequential. To ensure this one would group the particles intom
work queues and associate each queue with one digest, which is updated only from particles running within that queue, necessarily sequentially. This may be achievable with themoonpool
library ?Alternative: One alternative I can think of is to separately implement parallelism within the resampling step. E.g. use
P.iter
to extract the total population into some global array at the beginning of resampling, and then use some manual chunking and parallel computation on that array. Probably doable but maybe more cumbersome?