joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences
https://dynesty.readthedocs.io/
MIT License
347 stars 76 forks source link

Performance of hamiltonian slice sampling compared to random walk #146

Closed fbartolic closed 5 years ago

fbartolic commented 5 years ago

Hi Josh, thanks for releasing this fantastic package!

I'm using dynesty to fit a model defined in PyMC3 and comparing the performance to HMC. I find that Dynamic Nested Sampling with rwalk sampler works pretty well on an 8 dimensional model with a multi-modal posterior. However, when I pass the gradients to dynesty and use the hslice sampler the model doesn't really converge and the sampling efficiency drops from about ~3% with rwalk to about 0.3% with hslice.

Is this behaviour expected? Naively I'd expect that using the information about the gradients would improve the sampling performance rather than the other way around. The exact same model converges easily using NUTS in PyMC3 (albeit to only one of the modes).

Sorry I can't provide a minimal example, the code is a bit hacky with too many dependancies.

joshspeagle commented 5 years ago

This seems about right to me. Nested Sampling has a really hard time using gradient information because by construction it can't use the curvature of the space to plan subsequent proposals (which is what makes HMC so amazing) since that breaks the "uniform sampling" condition. This means proposals essentially involve moving in a straight line for a while before "bouncing" off the edge (the only step that uses the gradient). This is incredibly inefficient, especially since you need to have the bounce point be close to the edge to minimize terrible proposals (which requires quite small timesteps -- I think the default in dynesty is like 20/1 moves/bounce).

While the scaling with dimensionality is much better even with this really dumb scheme, the constant pre-factor out front related to these inefficient proposals essentially kill you when running on most reasonably-sized problems.

If convergence means "it runs too slowly", then that sounds like the expected behavior. If convergence means "even letting it run for a similar amount of iterations as rwalk I find weird behavior", that's more concerning. I didn't vet my implementation nearly as thoroughly as the other approaches since it was by construction such an edge case, so if it is having real performance issues I can look into officially deprecating it.

fbartolic commented 5 years ago

Thanks for clarifying! I'm still trying to wrap my head around the differences between Nested Sampling and HMC.

I mean it runs too slowly. Here's the output of the both runs:

'rwalk': iter: 37635 | batch: 8 | bound: 479 | nc: 25 | ncall: 1115285 | eff(%): 3.374 | loglstar: 1028.106 < 1036.613 < 1035.031 | logz: 1000.262 +/- 0.354 | stop: 0.877

'hslice': iter: 31883 | batch: 6 | bound: 311 | nc: 685 | ncall: 19756374 | eff(%): 0.161 | loglstar: 1028.503 < 1037.027 < 1035.168 | logz: 999.567 +/- 0.358 | stop: 0.970

~1M likelihood calls for rwalk vs ~20M for hslice.

joshspeagle commented 5 years ago

I'm still trying to wrap my head around the differences between Nested Sampling and HMC.

The fundamental difference for sampling is what the gradient can be used for. HMC, because it just needs to satisfy detailed balance (as an MCMC algorithm), uses the gradient to propose new positions along a given smooth trajectory. This allows it to take advantage of the curvature of the space. Nested Sampling, however, samples with a stronger constraint: generating samples uniformly with L > L_worst. Any attempt to use the gradient to propose positions using the shape of the posterior fundamentally violate this assumption, making it hard to incorporate. I'm sure there's a clever implementation out there that somehow gets around this problem, but it makes gradients only useful for "bouncing" within this uniform hyper-volume.

I mean it runs too slowly.

Ah, okay. Please let me know if it appears to give answers that are wildly off. Otherwise, I would recommend avoiding the method unless you're in >40 dimensions. ;)