Understanding parallel performance using an mpi pool

rory-smith commented 5 years ago

Hi @joshspeagle ,

I'm trying to better understand the expected performance of dynesty when using a parallel pool and I'd be grateful if you could clarify a couple of points.

Regarding proposing live points: I understand that with a pool of N cpus, it should be possible to draw batches of N proposals in parallel within the update_interval. Assuming that the update_interval is multiple of N, the reduction in cost should scale linearly with N. This is what I've observed in practice, when timing the number of calls to the likelihood function, and ignoring the cost of boundary updates. I just wanted to double check that this argument is robust.

Regarding updates to the the prior bounds: Using the default option update_bound=True, I'm curious how the prior bounds get updated in parallel, for say the multi bounding distribution. After reading section 4.1.3 of https://arxiv.org/pdf/1904.02180.pdf, I got the impression that the updates to the bounds would be deterministic and that decomposition-selection is serial (due to the recursion). Hence, I'd expect that updating the bounds in parallel doesn't really scale down the cost of the boundary updates. Is this the case, or is there another bit of parallelism that's exploited?

Thanks and best, Rory

joshspeagle commented 5 years ago

Both your intuitions are roughly correct, with some small caveats.

The parallelism scaling with the proposals actually is slightly sub-linear. This is because nested sampling requires serial updates of the sort L_proposed > L_worst. So when you make a parallel proposal L_worst is the same for all proposed points, as you accept new points serially L_worst will increase (since you are replacing it). This eventually leads to points being rejected, lowering the efficiency. I forget the exact expected scaling now, but could derive it for you if interested. I think it’s also in the PolyChord paper.

As for the parallel proposal updates, those are indeed generally deterministic and done all at once, so there is no benefit to parallelizing them. Parallelism instead happens through the “bootstrap” argument, which repeats the decomposition for random subsets of points to derive how much the bounds should be enlarged. This is disabled by default because it adds a lot of overhead and generally leads to very conservative bounds, but when this is enabled that’s the process that is parallelized.

rory-smith commented 5 years ago

Thanks for the explanation! I took a look at the polychord paper and the scaling makes sense.

Slightly off topic, but I've been experimenting with parallelized dynesty and I see markedly different behaviour in the way the output is reported depending on whether I use unif or rwalk sampling. Typically, when using unif sampling I get a steady stream of output with a roughly constant call time once the prior bounds start to be evaluated (by call time, I just mean dividing ncall by the approximate run time at a particular printed ncall). However, with rwalk output is steadily printed up to the point where the first prior bound is reached and the code appears to "hang" for a period in which no output appears. After a while, the output for many hundreds or thousands of iterations appear all at once. Importantly, the call time does appear to be higher in this mode of running. I was trying to understand if there should be anything significantly more expensive when running with rwalk, or if there's potentially a communication bottleneck/communication blocking/etc..., something of that type.

joshspeagle commented 5 years ago

the output for many hundreds or thousands of iterations appear all at once

That seems excessive, but the idea that output appears in batches is not crazy. The way parallelization works is pretty simplistic: you pipe all proposals to a pool.map instead of a normal map. The number of proposals you evaluate is just based on the queue_size, which is either taken from the pool.size trait (if I remember correctly) or can be specified by the user. This means that longer proposals all first get evaluated in parallel before getting pulled off the queue in serial and accepted/rejected. The latter step is pretty fast, so it essentially looks like you update in chunks.

output is steadily printed up to the point where the first prior bound is reached and the code appears to "hang" for a period in which no output appears

By default, dynesty samples uniformly from the prior until the condition is reached specifying the construction of the first bound. At that point, you switch over to rwalk, which is why there's a delay.

I was trying to understand if there should be anything significantly more expensive when running with rwalk

As described in the paper and documentation, rwalk generally proposes new positions with a fixed number of walks, leading to an overall sampling efficiency that is often lower than unif per proposal for many simple problems.

communication bottleneck

Some of this could be due to issues with stderr printing as well; there's a PR that should hopefully fix this.

joshspeagle / dynesty

Understanding parallel performance using an mpi pool #149