joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences
https://dynesty.readthedocs.io/
MIT License
347 stars 76 forks source link

rwalk/rstagger issues #320

Closed segasai closed 3 years ago

segasai commented 3 years ago

While running a few more checks for rwalk/rstagger, Specifically the dimensionality test:

python -c 'import test_highdim as T; T.do_gaussians(sample="rwalk")'

Here is a bias vs dimension for different samplers (all other parameters are default)

(updated image) image

I'm seeing that for rslice/unif samplers things are good, but for the rwalk the bias creeps in vs dimension. (to see if this is caused by my recent changes, I've also checked this 9b8d3f1620fbc06b178aa30cd529967540447902 (shown in red). And I think it shows we need to increase the walks parameter.

Also rstagger seems broken and I can't see the same problem in 9b8d3f1620fbc06b178aa30cd529967540447902 so it's something broken by a most recent change.

segasai commented 3 years ago

A somewhat connected question is

What is the rationale behind this logic in update_rwalk https://github.com/joshspeagle/dynesty/blob/b030ff76246b9b9a62c3e4da1285d8667d553fe5/py/dynesty/nestedsamplers.py#L152

It seems pretty involved without much justification.

segasai commented 3 years ago

The #321 restores the previous behaviour of rstagger. But it is still pretty bad.

image

I think there is something not right from statistical point of view in rstagger. I thing it's incorrect in general as it doesn't have the right invariant distribution. In fact here is a test program that is supposed to be sampling uniformly withing a diamond using stagger approach:

import numpy as np
import dynesty.sampling as S

def logl(x0):
    x, y = x0
    a = 10
    sl = 0.01
    x1 = np.abs(x)
    if x1 > a:
        return -1

    if np.abs(y) < (sl * (x1 - a)**2):
        return 1
    else:
        return -1

def priort(x):
    return x * 20 - 10

logls = 0
axes = np.eye(2) * .1
scale = 1
rstate = np.random.default_rng()
kwargs = {}
u = [.5, .5]
print(logl(priort(np.r_[0.5, 0.5])))
ret = []
for i in range(10000):
    u = S.generic_random_walk(u,
                              logls,
                              axes,
                              scale,
                              priort,
                              logl,
                              rstate,
                              kwargs,
                              stagger=True,
                              facc_target=.5)[0]
    ret.append(u)
ret = np.array(ret)

image

it's clearly nonuniform (while it is uniform in non-stagger mode)

joshspeagle commented 3 years ago

What is the rationale behind this logic in update_rwalk

https://github.com/joshspeagle/dynesty/blob/b030ff76246b9b9a62c3e4da1285d8667d553fe5/py/dynesty/nestedsamplers.py#L158-L161

Oh geez, that is a bit complicated isn't it.

My thinking at the time was probably that the code should try to target a particular acceptance fraction facc by adjusting the scale factor, but we don't want to (1) overshoot the optimal scale and (2) should make the adjustments symmetric about 0.5. So norm tries to do the symmetry bit while also making sure the reduction is in relative volume rather than just absolute scale (hence the ncdim dependence). Then I apply the update. Then as a final check the scale is prevented from getting larger than sqrt(ncdim) based on the size of the unit cube.

joshspeagle commented 3 years ago

And I think it shows we need to increase the walks parameter.

Ideally this should have some dimensional dependence. I know we've talked about this before, so do we want to just introduce that?

I think there is something not right from statistical point of view in rstagger. I thing it's incorrect in general as it doesn't have the right invariant distribution. In fact here is a test program that is supposed to be sampling uniformly withing a diamond using stagger approach

Oh yikes. Might be time to just yoink that code, since it is never used anyway and clearly is not giving good behaviour...

segasai commented 3 years ago

And I think it shows we need to increase the walks parameter.

Ideally this should have some dimensional dependence. I know we've talked about this before, so do we want to just introduce that?

It's already there https://github.com/joshspeagle/dynesty/blob/b030ff76246b9b9a62c3e4da1285d8667d553fe5/py/dynesty/dynesty.py#L117 But it seems that's not enough. (TBH, it's not obvious to me why in the case of a single gaussian 20+ndim walks is not enough). I'd have thought that almost perfect mixing should happen after ~2 steps (2 because the distance from side of the sphere is twice the radius).

I think there is something not right from statistical point of view in rstagger. I thing it's incorrect in general as it doesn't have the right invariant distribution. In fact here is a test program that is supposed to be sampling uniformly withing a diamond using stagger approach

Oh yikes. Might be time to just yoink that code, since it is never used anyway and clearly is not giving good behaviour...

I'm fine with that. Unless there is a paper that implements that algorithm and proves detailed balance (the pseudocode in Sivia & Skilling for sampling within a rectangle I don't think is good enough to me)

segasai commented 3 years ago

My thinking at the time was probably that the code should try to target a particular acceptance fraction facc by adjusting the scale factor, but we don't want to (1) overshoot the optimal scale and (2) should make the adjustments symmetric about 0.5. So norm tries to do the symmetry bit while also making sure the reduction is in relative volume rather than just absolute scale (hence the ncdim dependence). Then I apply the update. Then as a final check the scale is prevented from getting larger than sqrt(ncdim) based on the size of the unit cube.

Thanks for that description. I'll think more about this code. I think this is incorrect for example because scale is an extra factor applied on top of axes, so one should be limiting the resulting ellipsoid (i.e. scale*axes) (similar to how it was done for slice sampling).

But there is another separate issue, which may need to be addressed (which I think is more relevant in low dimensions when walks or slices are small). I think the update_* methods are in danger of using small number statistics and very noisy updates (which I think leads to bunch of warnings). I think a better strategy maybe either 1) to limit scale updates to say (0.25,2) or 2) accumulate the results between updates.

segasai commented 3 years ago

Just checked that rwalk didn't become worse in one of many changes over last few months: And it's definitely better than default behaviour in 1.1 image

joshspeagle commented 3 years ago

That's a relief to see! 😅

segasai commented 3 years ago

While thinking about scaling and cause for the bias here I've tried rwalk but with update_rwalk doing nothing other than set scale to 1 (because it's not really clear to me that scale needs to be different than 1)

And to my surprise this is the result (black is the default ) image

I can't really understand why this is happening...

joshspeagle commented 3 years ago

So in the case where the points are distributed in a Gaussian ellipsoid, you can show that the bounding ellipsoid, in general, should have the same shape as the covariance matrix with some scaling factor. In general to target a constant acceptance fraction for random-walk MCMC proposals for a Gaussian, you need to scale the covariance by a factor of 1/\sqrt(d) to maintain a constant acceptance fraction. I had put in a flexible scale both to account for incorrect ellipsoid distributions (failing to split/oversplitting) and because I expected something like ~1/sqrt(d) behavior might show up.

segasai commented 3 years ago

I can see some rationale in it, although it's not obvious how best sampling of a Gaussian distribution with a gaussian kernel from Roberts et al 1998 would apply to uniform random walk within uniform distribution. But this factor aside, it is still unclear to me, why things break down so catastrophically when you just force scale to 1.

segasai commented 3 years ago

Okay, it looks like in 10d the scale reaches the value of ~ 0.4-0.5 and leads to a few times more accepted points (15 vs 3 or something like that). So it seems that setting scale to 1 significantly reduces the acceptance rate and makes the chain more correlated and that leads to increased bias.

segasai commented 3 years ago

Bumping the walks from 20 +ndim to 40+2*ndim didn't change the bias

image

So it must be something else. I kind'a struggle to see what else it could be though.

joshspeagle commented 3 years ago

Could it be due to biased sampling in the beginning, perhaps? Where scale is slowly adjusting from 1 to ~0.5 at a rate that is slower based on the dimensionality? I haven't tried starting with setting scale to, say, ~1/sqrt(ncdim) right from the get-go.

segasai commented 3 years ago

Could it be due to biased sampling in the beginning, perhaps? Where scale is slowly adjusting from 1 to ~0.5 at a rate that is slower based on the dimensionality? I haven't tried starting with setting scale to, say, ~1/sqrt(ncdim) right from the get-go.

It doesn't seem to help either if I preset scale to 1/sqrt(ndim). Weird.

segasai commented 3 years ago

(edited many times) Okay, now I am 100% convinced that rwalk implementation is also incorrect because it does not obey detailed balance. Here is a illustration. Imagine our proposal kernel is a ball with radius D and we have two points x and y within D. I'll Define function A(x) which gives the volume of the part of the ball D centered on x that lies within our volume of interest. Now lets asssume we do rwalk with 'walks=1' I.e. one step. Since the code is written to try till acceptance (while ncalls<walks or naccept==0) it means that P(y|x) = 1/A(x) (i.e. it's uniform within the ball subject to volume constraints) Similarly P(x|y) =1/A(y) And now because our target distribution is \pi(x)=\pi(y)=1/Volume the detailed balance is clearly broken here when A(x)!=A(y), i.e. next to edges. If the algorithm is written in a way that it's allowed to return the current point the balance is immediately restored. Similarly, if the kernel parameters are such that (walks are high enough and kernel is good enough) that naccept==0 branch is never executed, the detailed balance is also conserved. (because that's equivalent to static number of single detailed-balance preserving steps).

The reason why we start to observe the bias at high dims is because when we start the first ellipsoids are easily larger than the cube and the fraction of the ellipsoid volume that is chopped by cube edge is large, and thus incorrect sampling there is important.

In terms of solution I don't really know if there is one that will guarantee a return of a new different point. I can think of doubling approaches, i.e. if we don't succesd in 'walks' steps, we try a new point and we double the 'walks'. But I'm worried about inventing a new statistical scheme without a mathematical proof of reversibility. (i.e. slice sampling is quite intricate exactly because of the constraint of detailed balance). A potential other solution (a-la rstager), if we run for 'walks' iterations without success, we can reduce the scale, and try again -- propose a new starting point do a sample_rwalk loop. But this also looks like a band aid, without guarantee of detailed balance.
Yet another solution may be accepting the return of the same point as we started with, and then essentially allow loosing live points in the process but that seems quite dangerous

joshspeagle commented 3 years ago

Okay, good to know the "fix" to avoid returning the same point is what's breaking things here. I think I put that in originally to deal with the fact that the starting point could have L=L*, but since that's now been rectified upstream I think we can just axe that and do the simple (correct) thing. All the tuning of the scale factor can then occur outside the actual random walk step.

segasai commented 3 years ago

But how do we deal with the two live-points sharing identical logl, etc. I don't think that's feasible (unless we start reducing the number of live-points)

joshspeagle commented 3 years ago

I think it should be okay provided it doesn't happen all that often. Since this is much more likely to happen early on in the run and near edges though, I'd want to see how it behaves in tests before drawing any conclusions.

segasai commented 3 years ago

I don't think it is a trivial change, I.e. I don't really know what we are supposed to do with volumes there. And we are supposed to remove somehow two points when they become the worst ones. Do we remove them one by one or both simultaneously. I think it's a major change there that needs some theoretical work. Also the ellipsoidal code is not prepared to take repeated points. Also that's a pretty fundamental assumption that likelihoods of all pts can be ordered using < operator. Dealing with this issue seems similar to dealing with issue with plateaus. And I think some of the solutions from what I remember indeed required reducing the number of livepoints.

joshspeagle commented 3 years ago

Can we get around this by just adding a small \epsilon to the final point (whatever it is), re-evaluate the likelihood, then return the new x + \epsilon point, where \epsilon is drawn from a re-scaled version of the current ellipsoid (e.g., 1e-5 scale cov)?

segasai commented 3 years ago

Not really. It wont be correct. It's more of a hack. I think proper ordering of distinct points is crucial as it determines the volume of the bulk of the posterior, so IMO one has to be careful with that. I think the most likely resolution IMO is to figure out how to exactly deal with repeats. And I think it has to be done by decreasing the number of effective live-points. I can easily check what happens with the shrinking fraction in this case using my toy nested sample case. Sure, there will be a possibility that all the live-points will die out, but maybe one can replenish them somehow.

joshspeagle commented 3 years ago

If you can somehow get that to work with the static nested sampling case and preserve some backwards compatibility, go for it. Otherwise, I would prefer to just find something that can be implemented more easily and is "good enough" to ameliorate most of the bias, even if it's not 100% correct.

segasai commented 3 years ago

I think the danger in my opinion is that we have well quantified bias in the case of Gaussian, but most people use nested sampling when they have something quite complicated, where who knows what's happening, and what's the effect of the incorrect sampling . And with the current behaviour of rwalk, I'd only trust it if the posterior is far enough from the edges of the prior and if the contours of the posterior are close enough to an ellipsoid. (rslice doesn't have any of these issues, so IMO that should be potentially the main sampler )

segasai commented 3 years ago

Anyway with the test change of allowing returning the same points:

https://github.com/joshspeagle/dynesty/compare/master...segasai:test_branch?expand=1

Importantly that also required only running one single iteration of propose_ball_point() rather than running it till it returns a point.

So with that change to my surprise things just work without a clear break down Screenshot from 2021-08-31 19-58-43

This will make sampling drastically less efficient in the beginning as I disabled all the tunings. That also leaves a possibility of ellipsoidal breakdown as some of them could become poorly defined if there are too many repeats.

joshspeagle commented 3 years ago

And with the current behaviour of rwalk, I'd only trust it if the posterior is far enough from the edges of the prior and if the contours of the posterior are close enough to an ellipsoid. (rslice doesn't have any of these issues, so IMO that should be potentially the main sampler )

I'm happy switching over rslice to be the main sampler for problems with >15-D given it doesn't have this problem and is more robust, although given your results I would want to verify whether the performance of rwalk on more pathological cases doesn't implode.

This will make sampling drastically less efficient in the beginning as I disabled all the tunings.

So this just keeps a fixed scale for everything you mean?

segasai commented 3 years ago

And with the current behaviour of rwalk, I'd only trust it if the posterior is far enough from the edges of the prior and if the contours of the posterior are close enough to an ellipsoid. (rslice doesn't have any of these issues, so IMO that should be potentially the main sampler )

I'm happy switching over rslice to be the main sampler for problems with >15-D given it doesn't have this problem and is more robust, although given your results I would want to verify whether the performance of rwalk on more pathological cases doesn't implode.

TBH my current view is I don't really understand why the patched rwalk gives the correct result now because it shouldn't ))

This will make sampling drastically less efficient in the beginning as I disabled all the tunings.

So this just keeps a fixed scale for everything you mean?

No, I meant within a single sample_rwalk there is no tuning. I.e. it strictly does this loop 'walks' times:

1) propose one point in the ellipsoid. if outside cube, 'continue' 2) if point inside ellipsoid and logl condition is satisfied move the current point to new location

return current location

segasai commented 3 years ago

Just for completeness Screenshot from 2021-08-31 20-54-52 I attach the result based on the submitted patch. I think I'll stop on this issue for the moment. I'd say for the time being I'd consider rslice trustworthy, while rwalk more experimental based on these changes.