joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences
https://dynesty.readthedocs.io/
MIT License
355 stars 77 forks source link

Parallelization issue with "unif" sampler #425

Closed MikhailBeznogov closed 1 year ago

MikhailBeznogov commented 1 year ago

Hello,

I have noticed a strange behavior of dynesty when parallelized (either with multiprocessing pool or MPI).

Before the first_update criterion is met, the sampler is always unif and it is indeed parallelized. However, after the first_update the situation changes. If the requested sampler is unif, then, after the first_update, it is no longer working in parallel. To be more specific, multiprocessing pool is running, but only one process at a time is actually fully loaded, the other processes in the pool are idle. The number of likelihood evaluations per second also correspond to single-threaded computation.

If the requested sampler is other than unif (I have explicitly tried rwalk and rslice), then, after the first_update, the sampling is performed in parallel (and, judging by the CPU load, even more efficiently than before the first_update).

Here are some examples. The first screenshot shows CPU load with the unif sampler before and after the first_update (left black vertical line marks the start of the computation, the right one marks the first_update).

unif_crop

The second screenshot shows the same for the rslice sampler. Black vertical line marks the first_update. The difference can be clearly seen.

rslice_crop

Since the likelihood in my case takes only 1-2 ms to evaluate, I have also tried the same with huge "over-subscription" by setting queue_size to much more than the actual number of threads (1600 vs 32). The results for the unif and rslice samplers are demonstrated in the next two screenshots.

unif_os_crop rslice_os_crop

As one can see, for the unif sampler the results are qualitatively the same, although the performance has increased roughly 3 times (from ~10it/s just before the first_update to ~30it/s). The same goes for rslice sampler, which become even more parallel-efficient.

I have also checked the behavior of the unif sampler when parallelized via MPI (without "over-subscription"). The CPU load in that case is always 100%, but the drop in the performance after the first_update clearly indicates that the the situation is the same as with multiprocessing pool. Here is a part of the output "around" the first_update:

5503it [02:38,  7.10it/s, batch: 0 | bound: 0 | nc: 1863 | ncall: 1090741 | eff(%):  0.504 
5505it [02:39,  6.49it/s, batch: 0 | bound: 0 | nc: 2568 | ncall: 1093355 | eff(%):  0.503 
5506it [02:39,  4.75it/s, batch: 0 | bound: 0 | nc: 3290 | ncall: 1096645 | eff(%):  0.502 
5507it [02:39,  4.49it/s, batch: 0 | bound: 0 | nc: 1842 | ncall: 1098487 | eff(%):  0.501 
5508it [02:40,  3.25it/s, batch: 0 | bound: 0 | nc: 4285 | ncall: 1102772 | eff(%):  0.499 
5509it [02:45,  1.40s/it, batch: 0 | bound: 1 | nc: 742 | ncall: 1103514 | eff(%):  0.499 
5510it [02:58,  4.44s/it, batch: 0 | bound: 1 | nc: 2159 | ncall: 1105673 | eff(%):  0.498 
5511it [03:02,  4.20s/it, batch: 0 | bound: 1 | nc: 611 | ncall: 1106284 | eff(%):  0.498 
5512it [03:03,  3.30s/it, batch: 0 | bound: 1 | nc: 166 | ncall: 1106450 | eff(%):  0.498 
5513it [03:10,  4.52s/it, batch: 0 | bound: 1 | nc: 1273 | ncall: 1107723 | eff(%):  0.497

The loss of performance after the first_update is clearly seen (both in number of iteration per second and in number of likelihood evaluations per second).

Consequently, using the unif sampler is very slow (unless bound is set to none and the first_update is not executed, but this is extremely inefficient in combination with the unif sampler).

P.S. All tests were done with the latest version, 2.1.0; first_update={'min_eff':0.50} in all tests, slices=100 for the rslice sampler, bootstrap = 50. The tests above were performed with bound = 'multi', but the situation with other types of bounds is similar.

segasai commented 1 year ago

Hi,

I'll start with this example that shows that the uniform sampler does utilise all the cores if you have a heavy to compute function:

import numpy as np
import pytest
import dynesty
import multiprocessing as mp
import dynesty.pool as dypool

nlive = 1000
printing = True

ndim = 10
gau_s = 0.01

def loglike_gau(x):
    for i in range(1000000):
        1 + 1
    return (-0.5 * np.log(2 * np.pi) * ndim - np.log(gau_s) * ndim -
            0.5 * np.sum((x - 0.5)**2) / gau_s**2)

def prior_transform_gau(x):
    return x

LOGZ_TRUTH_GAU = 0
LOGZ_TRUTH_EGG = 235.856

def test_pool_samplers():
    # this is to test how the samplers are dealing with queue_size>1
    rstate = np.random.default_rng(2)

    with mp.Pool(10) as pool:
        sampler = dynesty.NestedSampler(loglike_gau,
                                        prior_transform_gau,
                                        ndim,
                                        nlive=nlive,
                                        sample='unif',
                                        pool=pool,
                                        queue_size=10,
                                        rstate=rstate,
                                        first_update={
                                            'min_eff': 90,
                                            'min_ncall': 10
                                        })
        sampler.run_nested(print_progress=printing)
        assert (abs(LOGZ_TRUTH_GAU - sampler.results['logz'][-1]) <
                5. * sampler.results['logzerr'][-1])

if __name__ == '__main__':
    test_pool_samplers()

Regarding what's happening in your case. First there is not enough info at the moment to diagnose (one needs a reproduceable example). But the key difference of uniform sampling is that the proposal of points within ellipsoids is done in a single thread. And only later the likelihoods are being evaluated in parallel.

A possibility for what you're seeing is that the bounding ellipsoid you get for your problem is extreme (i.e. very elongated, much longer than the length of the cube), in that case most of the proposed points within the ellipsoid will be outside the cube and the code will try again etc, etc and that will all be done in a single thread, before the likelihood function is evaluated.

Another possibility is that your function is just too heavy to pickle and that dominates the overheads.

You can test the first hypothesis by lowering the threshold for the warning here https://github.com/joshspeagle/dynesty/blob/2d49cf9104e417a44ada7f2f1d013cbfee0acf01/py/dynesty/nestedsamplers.py#L705

The second hypothesis you can test by using dynesty.pool.Pool which eliminates the pickling overhead.

finally, if you cannot share a reproduceable example, you should at least share the exact way you call dynesty (with all the information, such as nlive, ndim etc. etc ) and ideally all the output from the problematic unif run.

MikhailBeznogov commented 1 year ago

Hello,

Thank you for your help. I will address you questions below.

  1. Here is how I call dynesty with "over-subscription" (ndim=7,npoints=1000):
    with Pool() as MP_pool:
    sampler  = dynesty.DynamicNestedSampler(Chi2_Tot,Prior,ndim,bound='multi',
                                            sample='unif', 
                                            first_update={'min_eff':0.50},
                                            update_interval=500.5,
                                            bootstrap=50,enlarge=1.0,
                                            pool = MP_pool,queue_size=1600)
    sampler.run_nested(n_effective=10000,dlogz_init=0.05,nlive_init=npoints,
                       nlive_batch=500)
  2. I do not think that pickling is an issue as dynesty works efficiently in parallel with other samplers (see rslice examples in my opening post). Moreover, emcee and my tests where I evaluate likelihood in parallel using map of multiprocessing pool also work fine and with high parallel efficiency. The former requires enough (>1000) walkers and the latter requires setting chunksize to at least 100 to be really parallel efficient, but I suppose it is due to the fact that the likelihood takes only ~1 ms to evaluate and multiprocessing pool inevitably introduces some overhead.
  3. I think that very elongated ellipsoids might be indeed the cause of the issue. The posterior distributions of some of the parameters have very heavy tails going to the limits of the model parameters' ranges. If this is the case, is there any way to circumvent the issue? I really prefer the unif sampler as it is more robust and does not require to estimate the proper number of walks or slices.

I will send a working example privately.

segasai commented 1 year ago

Hi,

I don't quite agree with your comment number 2. The reason is that when you use rslice or rwalk with say walks=50, it means that each thread will execute at least 50 likelihood calls per one pickle. When using the uniform sampler it'll be always 1 call per 1 pickle, because of that rslice/rwalk will look better in terms of parallelisation.

But certainly option 3 is quite possible. And I think there are are a few possible improvements in that area. One of which is doing the ellipsoidal sampling in parallel (it will have an overhead of sending over the ellipsoidal bounds), the other one is detecting the crazy ellipsoids and maybe reshaping them somewhat

segasai commented 1 year ago

An update on my side. I had a bit of time to look into this. And a couple of points

MikhailBeznogov commented 1 year ago

Hello,

Thanks for the update.

If fast likelihood functions cause the sampler to lose parallel efficiency, perhaps adding an option to vectorize them (i.e., evaluate likelihood at more than one point per call) will help? For example, this is implemented in JohannesBuchner/UltraNest and seems to be efficient if the number of requested points per one call is big enough. If the likelihood does not support vectorization itself, it can be easily "wrapped" by mapping:

def Chi2_Tot_Vect(points):
    return np.array(list(map(Chi2_Tot,points)))
segasai commented 1 year ago

I think the #427 addresses some of the issues of updating bounds, so I am reasonably convinced that now things are functioning as they should be. But it is clear that in this case

segasai commented 1 year ago

I'm closing this issue for the time being. The different parallelization scheme for unif sampler would be still be good, but I don't think there is a bug per se.

MikhailBeznogov commented 1 year ago

Hello,

Sorry for not replying earlier.

Yes, I agree, it is not a bug, just a specific feature of implementation.

Thank you for your help.