Closed MikhailBeznogov closed 1 year ago
Hi,
I'll start with this example that shows that the uniform sampler does utilise all the cores if you have a heavy to compute function:
import numpy as np
import pytest
import dynesty
import multiprocessing as mp
import dynesty.pool as dypool
nlive = 1000
printing = True
ndim = 10
gau_s = 0.01
def loglike_gau(x):
for i in range(1000000):
1 + 1
return (-0.5 * np.log(2 * np.pi) * ndim - np.log(gau_s) * ndim -
0.5 * np.sum((x - 0.5)**2) / gau_s**2)
def prior_transform_gau(x):
return x
LOGZ_TRUTH_GAU = 0
LOGZ_TRUTH_EGG = 235.856
def test_pool_samplers():
# this is to test how the samplers are dealing with queue_size>1
rstate = np.random.default_rng(2)
with mp.Pool(10) as pool:
sampler = dynesty.NestedSampler(loglike_gau,
prior_transform_gau,
ndim,
nlive=nlive,
sample='unif',
pool=pool,
queue_size=10,
rstate=rstate,
first_update={
'min_eff': 90,
'min_ncall': 10
})
sampler.run_nested(print_progress=printing)
assert (abs(LOGZ_TRUTH_GAU - sampler.results['logz'][-1]) <
5. * sampler.results['logzerr'][-1])
if __name__ == '__main__':
test_pool_samplers()
Regarding what's happening in your case. First there is not enough info at the moment to diagnose (one needs a reproduceable example). But the key difference of uniform sampling is that the proposal of points within ellipsoids is done in a single thread. And only later the likelihoods are being evaluated in parallel.
A possibility for what you're seeing is that the bounding ellipsoid you get for your problem is extreme (i.e. very elongated, much longer than the length of the cube), in that case most of the proposed points within the ellipsoid will be outside the cube and the code will try again etc, etc and that will all be done in a single thread, before the likelihood function is evaluated.
Another possibility is that your function is just too heavy to pickle and that dominates the overheads.
You can test the first hypothesis by lowering the threshold for the warning here https://github.com/joshspeagle/dynesty/blob/2d49cf9104e417a44ada7f2f1d013cbfee0acf01/py/dynesty/nestedsamplers.py#L705
The second hypothesis you can test by using dynesty.pool.Pool which eliminates the pickling overhead.
finally, if you cannot share a reproduceable example, you should at least share the exact way you call dynesty (with all the information, such as nlive, ndim etc. etc ) and ideally all the output from the problematic unif run.
Hello,
Thank you for your help. I will address you questions below.
dynesty
with "over-subscription" (ndim=7
,npoints=1000
):
with Pool() as MP_pool:
sampler = dynesty.DynamicNestedSampler(Chi2_Tot,Prior,ndim,bound='multi',
sample='unif',
first_update={'min_eff':0.50},
update_interval=500.5,
bootstrap=50,enlarge=1.0,
pool = MP_pool,queue_size=1600)
sampler.run_nested(n_effective=10000,dlogz_init=0.05,nlive_init=npoints,
nlive_batch=500)
dynesty
works efficiently in parallel with other samplers (see rslice
examples in my opening post). Moreover, emcee
and my tests where I evaluate likelihood in parallel using map
of multiprocessing pool also work fine and with high parallel efficiency. The former requires enough (>1000) walkers and the latter requires setting chunksize
to at least 100
to be really parallel efficient, but I suppose it is due to the fact that the likelihood takes only ~1 ms to evaluate and multiprocessing pool inevitably introduces some overhead. unif
sampler as it is more robust and does not require to estimate the proper number of walks
or slices
. I will send a working example privately.
Hi,
I don't quite agree with your comment number 2. The reason is that when you use rslice or rwalk with say walks=50, it means that each thread will execute at least 50 likelihood calls per one pickle. When using the uniform sampler it'll be always 1 call per 1 pickle, because of that rslice/rwalk will look better in terms of parallelisation.
But certainly option 3 is quite possible. And I think there are are a few possible improvements in that area. One of which is doing the ellipsoidal sampling in parallel (it will have an overhead of sending over the ellipsoidal bounds), the other one is detecting the crazy ellipsoids and maybe reshaping them somewhat
An update on my side. I had a bit of time to look into this. And a couple of points
Hello,
Thanks for the update.
If fast likelihood functions cause the sampler to lose parallel efficiency, perhaps adding an option to vectorize them (i.e., evaluate likelihood at more than one point per call) will help? For example, this is implemented in JohannesBuchner/UltraNest and seems to be efficient if the number of requested points per one call is big enough. If the likelihood does not support vectorization itself, it can be easily "wrapped" by mapping:
def Chi2_Tot_Vect(points):
return np.array(list(map(Chi2_Tot,points)))
I think the #427 addresses some of the issues of updating bounds, so I am reasonably convinced that now things are functioning as they should be. But it is clear that in this case
I'm closing this issue for the time being. The different parallelization scheme for unif sampler would be still be good, but I don't think there is a bug per se.
Hello,
Sorry for not replying earlier.
Yes, I agree, it is not a bug, just a specific feature of implementation.
Thank you for your help.
Hello,
I have noticed a strange behavior of
dynesty
when parallelized (either with multiprocessing pool or MPI).Before the
first_update
criterion is met, the sampler is alwaysunif
and it is indeed parallelized. However, after thefirst_update
the situation changes. If the requested sampler isunif
, then, after thefirst_update
, it is no longer working in parallel. To be more specific, multiprocessing pool is running, but only one process at a time is actually fully loaded, the other processes in the pool are idle. The number of likelihood evaluations per second also correspond to single-threaded computation.If the requested sampler is other than
unif
(I have explicitly triedrwalk
andrslice
), then, after thefirst_update
, the sampling is performed in parallel (and, judging by the CPU load, even more efficiently than before thefirst_update
).Here are some examples. The first screenshot shows CPU load with the
unif
sampler before and after thefirst_update
(left black vertical line marks the start of the computation, the right one marks thefirst_update
).The second screenshot shows the same for the
rslice
sampler. Black vertical line marks thefirst_update
. The difference can be clearly seen.Since the likelihood in my case takes only 1-2 ms to evaluate, I have also tried the same with huge "over-subscription" by setting
queue_size
to much more than the actual number of threads (1600 vs 32). The results for theunif
andrslice
samplers are demonstrated in the next two screenshots.As one can see, for the
unif
sampler the results are qualitatively the same, although the performance has increased roughly 3 times (from ~10it/s just before thefirst_update
to ~30it/s). The same goes forrslice
sampler, which become even more parallel-efficient.I have also checked the behavior of the
unif
sampler when parallelized via MPI (without "over-subscription"). The CPU load in that case is always 100%, but the drop in the performance after thefirst_update
clearly indicates that the the situation is the same as with multiprocessing pool. Here is a part of the output "around" thefirst_update
:The loss of performance after the
first_update
is clearly seen (both in number of iteration per second and in number of likelihood evaluations per second).Consequently, using the
unif
sampler is very slow (unlessbound
is set tonone
and thefirst_update
is not executed, but this is extremely inefficient in combination with theunif
sampler).P.S. All tests were done with the latest version, 2.1.0;
first_update={'min_eff':0.50}
in all tests,slices=100
for therslice
sampler,bootstrap = 50
. The tests above were performed withbound = 'multi'
, but the situation with other types of bounds is similar.