Understanding parallelism/queue_size

segasai commented 3 years ago

I've been looking at the parallelization implementation and I'm somewhat struggling to see the point of queue_size.

Basically the only time when the get_point(logl>logLstar) is called many times is this loop https://github.com/joshspeagle/dynesty/blob/18a953e4138c842b3cbee971ccf4f2dc9a81da65/py/dynesty/sampler.py#L403 And here the loop is executed many times only if propose_point + evolve_point do not actually return the point with L>loglstar But that I think will only happen when the sampler is 'unif'. I don't think there will be any actual speedup if sampler != unif (the cores will still fill the queue, but those will be unused as far as I can see).

Am I missing something ?

(Isn't the right paralleization strategy to fill the queue with samples satisfying L>Llive[0], L>Llive[1], L>Llive[2] etc. (where Llive are sorted likelihoods of live points)

joshspeagle commented 3 years ago

The logic behind the parallelization strategy is old and could be updated, but follows the basic idea of simulating a "serial" nested sampling operation. Which is that you want to draw a live point at iteration i subject to the constraint L_i > order(L_live_i, 1) (where order(X, K) grabs the K-th smallest output from X), and then repeat. When doing this operation in parallel, the problem is you have some non-zero probability that the new sampled likelihood will be order(L_live_i, 1) < L_i < order(L_live_i, 2). So if you were doing things serially, you would end up with L_i = order(L_live_{i+1}, 1) and therefore your L_i > order(L_live_i, 2) sample would be missing some of the prior volume. From a traditional viewpoint, the only solution to this is to propose a bunch of points subject to L > order(L_live_i, 1), then pretend they were proposed serially so you can update the likelihood bound at each "iteration". See the PolyChord paper for details and a plot showing expected parallel inefficiencies as a result of this strategy.

joshspeagle commented 3 years ago

On a related note, however...

In the more expansive view of NS that we now have, the above framework is overly restrictive. There is actually a lot of flexibility for how points can be added in...as long as we keep proper track of them. In particular, under the current proposal scheme, any point that fails to satisfy the current "serial update" likelihood thresholds instead could be retroactively added to the run by adjusting the effective number of live points back when it was first proposed. In essence, adding in queue_size points is functionally equivalent to a batch update with queue_size live points in a dynamic nested sampling run (so increasing from K -> K + queue_size), so fully efficient parallelism just turns every nested sampling run into a dynamic one.

Likewise, under your proposal strategy, the same would hold true but using the likelihood thresholds based on the set of live points when the batch is proposed (so it would be a bit more dispersed).

I don't know how difficult it would be to implement this improved parallelism scheme (even in just the dynamic nested sampler) though.

segasai commented 3 years ago

Ok, I see the flaw in my idea, but the original question still stands. AFAICS queue_size is noop for anything but uniform sampling, isn't it ?

segasai commented 3 years ago

Yes, one of my first thoughts was when I started to look at dynesty was to improve parallelization. I think now the overall infrastructure is in better shape, and one can start looking at this. And I agree with your point about changing number of live-points to fully parallelize.

joshspeagle commented 3 years ago

AFAICS queue_size is noop for anything but uniform sampling, isn't it ?

It should just pull points off the queue that have been proposed using whatever strategy the code relies on, subject to L > L_live[0]. This means filling the queue for other methods is slower (it's just a pool.map operation), but the behaviour should be the same once points are on the queue, no?

segasai commented 3 years ago

I think what I'm saying that (and I think the test that I've done confirms it). After the first boundary is constructed, unless the the sampler is unif, _get_point_value is guaranteed to return point above loglstar. therefore https://github.com/joshspeagle/dynesty/blob/18a953e4138c842b3cbee971ccf4f2dc9a81da65/py/dynesty/sampler.py#L418 this will exit the loop. Because the next call of _new_point will be with higher loglstar, the whole queue of values is wasted with the exception of first value.

I.e. with the default configuration when ndim>10 (and therefore non-uniform sampler is used) the code will be faster only up to the first update (when sampling switches from uniform to the one based on live-points).

joshspeagle commented 3 years ago

_get_point_value is guaranteed to return point above loglstar

The first time. But then loglstar should be updated. Also, the queue sticks around until it is fully emptied; the other points shouldn't be wasted, even if loglstar is not properly updated. _get_point_value tries to pop the queue, and just refills it if it's empty.

https://github.com/joshspeagle/dynesty/blob/18a953e4138c842b3cbee971ccf4f2dc9a81da65/py/dynesty/sampler.py#L383-L396

segasai commented 3 years ago

Ah, I see. Sorry, I didn't realize that there is no check that the queue was from the previous run with a lower loglstar threshold. Got it now. I'll close the issue for now. (and will think a bit more about parallelization)

segasai commented 3 years ago

On parallelization. I've just did a quick check of getting rid of Nthread worst points at each time I.e. see the second f-n

import numpy as np

def doin(N, nit):
# simple nested sampling 
    x = np.log(np.random.uniform(size=N))
    x = np.sort(x).tolist()
    ret = []
    for i in range(nit):
        pos = np.argmax(x)
        maxv = x[pos]
        ret.append(maxv)
        del x[pos]
        newv = maxv + np.log(np.random.uniform())
        x.append(newv)
    return ret

def doin2(N, nit, nthreads):
    # parallel 
    x = np.log(np.random.uniform(size=N)) 
   # initial log volumes
    x = x.tolist()
    ret = []
    for i in range(nit):
        x = sorted(x)
        drops = x[-nthreads:]
        ret.extend(drops)
        del x[-nthreads:]
        newv = np.array(drops) + np.log(np.random.uniform(size=nthreads))
        x.extend(newv.tolist())
    return ret

Here are the log-volume vs iteration for Nlive=33, nthreads=32

Left are are the points as they receieved. Right are the sorted ones. The slope is still -1/33 as expected with the exception of the very edge, where we run out points, because some points "lag behind".

That doesn't seem that hard to implement, except requiring constant resorting of logl history.

The only problem I see though is for uniform sampler, doing sampling constrained by order(L,x) where ~ Nlive may be inefficient.

joshspeagle / dynesty

Understanding parallelism/queue_size #312