joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences
https://dynesty.readthedocs.io/
MIT License
357 stars 77 forks source link

Convergence is slow in high dimensions #97

Closed dtjohnson closed 6 years ago

dtjohnson commented 6 years ago

Thanks for the great library!

I'm having issues with convergence in a problem with ~150 dimensions. I've created a toy problem in this notebook that exhibits the same issue I'm having with my problem.

In this toy problem we're looking at a data signal that is the sum of gaussians centered at all positions from 0-150 with unknown amplitudes but known widths. The total signal is subject to noise of a known magnitude. The majority of the gaussians have amplitude set exactly to 0 indicating absence of that signal. The posterior gives the probability of the amplitudes at each position. Since most of these amplitudes are 0, the posterior peaks at an edge in the hypercube.

The challenge I'm having is that I'm getting extremely slow convergence with dynesty. I've tried numerous settings and not having any luck. The best I've found is the slice sampler, which takes about 8 mins to converge with only nlive=1.

Random walk should work well in high dimensions but it doesn't seem that your implementation is adapting the step size at each accept/reject. Because of that the rwalk sampler has an extremely poor efficiency and stalls.

I implemented a very naive nested sampling algorithm in the same notebook a la Skilling/Sivia that adjust the random walk scale at each sample and it converges in only 5 secs.

Is there something I'm missing?

joshspeagle commented 6 years ago

I've created a toy problem in this notebook that exhibits the same issue I'm having with my problem.

Thanks for the notebook. It was helpful to illustrate the issue you are describing.

Since most of these amplitudes are 0, the posterior peaks at an edge in the hypercube.

It's good to note that your problem is mal-adapted to nested sampling (and MCMC as well), since having the likelihood slam up against the edge of the prior makes it difficult to sample in almost any case: assuming you're proposing symmetrically, you'll quickly have ~1-(1/2)^(n_zeros) samples wasted querying outside the unit cube, which is probably what's happening with most configurations. The multivariate slice sampler slice is probably doing alright because it's proposing along the principle axes and is not sensitive to the choice of bounds. I would naively recommend the use of a log-space-based prior such as, e.g., a log-normal to "spread out" the probability near zero to avoid this issue.

Random walk should work well in high dimensions but it doesn't seem that your implementation is adapting the step size at each accept/reject. Because of that the rwalk sampler has an extremely poor efficiency and stalls.

Unfortunately, this has nothing to do with random walk's poor performance in high dimensions. While your problem is particularly pathological, in general random walks scale very poorly. See, e.g., Neal (2012). Slice sampling (slice, rslice, and hslice) scales better, but ultimately gradients are really what you need in really high dimensions.

I implemented a very naive nested sampling algorithm in the same notebook a la Skilling/Sivia that adjust the random walk scale at each sample and it converges in only 5 secs. Is there something I'm missing?

Looking over your notebook, it looks like you violate detailed balance at the edges by thresholding to 0 and 1. While this makes it straightforward to run, it means that you're actually changing the proposal distribution based on your location in addition to the Sivia & Skilling (2006) algorithm so you're not actually proposing fully "random" samples from the constrained prior. While this indeed leads to rapid convergence (since all proposals are valid, you're not wasting any proposals querying invalid points), it's not formally correct. This probably doesn't matter for this problem because there's such a large difference between the 0 and non-zero values, but that's what's going on.

On a sidenote, you've also automatically set the proposal to be symmetric; dynesty tries to determine the axes adaptively during runtime using the distribution of live points, but defaults to this behavior when only one point is provided (as in your notebook).

joshspeagle commented 6 years ago

Unrelated to your specific issue: if you'd like, I'm happy to try and add the Sivia & Skilling proposal into dynesty as a separate sampling scheme.

dtjohnson commented 6 years ago

Hey! My distribution is no more pathological than the beta distribution. :)

Good catch on the detailed balance. I changed to use np.mod(trial, 1.0). It does slow down convergence on my naive implementation for edge points, as you predicted.

I'm familiar with HMC and NUTS. Slice sampling is new to me. I do need to dig into it and your particular implementation to understand them better. I've used NUTS in PyMC3 and Sampyl and found it to be a pain in the neck to get Theano and Autograd, respectively, to play nice beyond simple problems. I'm curious how you are calculating gradients, but I did switch to hslice in this problem and found it to be much faster than slice.

While the fact that many of the values are hard 0's is an interesting challenge, the divergence in performance between hslice and my naive implementation (with the wraparound correction) is actually more severe if you set them all to non-zero values. For example:

true_a = np.full(n, 0.5)

Now the posterior peaks at the center of the unit cube but the performance of hslice is an order of magnitude slower than the naive implementation. Furthermore, when I increase n to 500, the naive implementation finds a solution but hslice crashes with an overflow error.

/Users/djohnson/anaconda3/lib/python3.6/site-packages/dynesty/dynesty.py:216: UserWarning: Beware: `nlive <= 2 * ndim`!
  warnings.warn("Beware: `nlive <= 2 * ndim`!")
iter: 1009 | bound: 0 | nc: 317 | ncall: 281813 | eff(%):  0.358 | loglstar:   -inf < -9232.623 <    inf | logz: -9932.674 +/- 22.026 | dlogz:  5.391 >  0.001            /Users/djohnson/anaconda3/lib/python3.6/site-packages/dynesty/sampler.py:709: RuntimeWarning: overflow encountered in double_scalars
  math.exp(loglstar_new - logz_new) * loglstar_new)
iter: 1010 | bound: 0 | nc: 292 | ncall: 282105 | eff(%):  0.358 | loglstar:   -inf < -9222.561 <    inf | logz: -9923.332 +/-    nan | dlogz: 10.728 >  0.001            /Users/djohnson/anaconda3/lib/python3.6/site-packages/dynesty/sampler.py:713: RuntimeWarning: invalid value encountered in double_scalars
  dh = h_new - h
iter: 1017 | bound: 0 | nc: 286 | ncall: 284138 | eff(%):  0.358 | loglstar:   -inf < -8964.442 <    inf | logz: -9670.066 +/-    nan | dlogz: 38.261 >  0.001            /Users/djohnson/anaconda3/lib/python3.6/site-packages/dynesty/sampler.py:708: RuntimeWarning: overflow encountered in double_scalars
  lzterm = (math.exp(loglstar - logz_new) * loglstar +
iter: 1023 | bound: 0 | nc: 283 | ncall: 285860 | eff(%):  0.358 | loglstar:   -inf < -8799.145 <    inf | logz: -9508.928 +/-    nan | dlogz: 37.530 >  0.001            
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<timed exec> in <module>()

~/anaconda3/lib/python3.6/site-packages/dynesty/sampler.py in run_nested(self, maxiter, maxcall, dlogz, logl_max, add_live, print_progress, print_func, save_bounds)
    828                                      logl_max=logl_max,
    829                                      save_bounds=save_bounds,
--> 830                                      save_samples=True)):
    831             (worst, ustar, vstar, loglstar, logvol, logwt,
    832              logz, logzvar, h, nc, worst_it, boundidx, bounditer,

~/anaconda3/lib/python3.6/site-packages/dynesty/sampler.py in sample(self, maxiter, maxcall, dlogz, logl_max, save_bounds, save_samples)
    707             logz_new = np.logaddexp(logz, logwt)
    708             lzterm = (math.exp(loglstar - logz_new) * loglstar +
--> 709                       math.exp(loglstar_new - logz_new) * loglstar_new)
    710             h_new = (math.exp(logdvol) * lzterm +
    711                      math.exp(logz - logz_new) * (h + logz) -

OverflowError: math range error

I've also seen this same overflow error with only ~150 dimensions when I incorporate multiple sets of data on the sample population.

joshspeagle commented 6 years ago

My distribution is no more pathological than the beta distribution.

Depends on the coefficients. My point was not that your case is somehow "special": there are many distributions that are "pathological" for nested sampling (and/or standard Gaussian proposal-MH MCMC) due to the ways that the priors are set up or the way that sampling proceeds.

I changed to use np.mod(trial, 1.0).

This should be equivalent to assuming periodic boundary conditions. I hope to have that integrated into dynesty soon following #94.

I'm curious how you are calculating gradients, but I did switch to hslice in this problem and found it to be much faster than slice.

I'm not, actually -- hslice implements a modified version of the algorithm proposed/outlined in this paper. I haven't updated the docs to include a proper description yet, which is my fault for being slow.

Now the posterior peaks at the center of the unit cube but the performance of hslice is an order of magnitude slower than the naive implementation.

This is because hslice (and other slice sampling approaches) don't necessarily have a "fixed" number of trials at each iteration, unlike the random walk approaches. The exact dependencies can be somewhat complicated, but in general the autocorrelation between samples for hslice should be on the order of 1.

when I increase n to 500, the naive implementation finds a solution but hslice crashes with an overflow error.

Looks like more issues with log math. I've tried to get most of them sorted out within dynesty but I guess there are some more issues. I'll see what I can do.

joshspeagle commented 6 years ago

Also, for this problem you don't actually need to use any autograd software at all: the gradients are all analytic. I've actually implemented a simple Hamiltonian sampler for a similar problem, which I can link to if you're interested. It's a better choice if you don't mind putting gradients in by hand.

dtjohnson commented 6 years ago

Unfortunately, while the math is easy in this toy problem, my real problem is much more complicated. Computing the gradients by hand isn't realistic. I'll take a look at that paper for hslice.

joshspeagle commented 6 years ago

Okay, sounds good. I'll see if I can figure out what's going wrong with the log-based math and will plan leave this open until that's resolved.