joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences
https://dynesty.readthedocs.io/
MIT License
347 stars 76 forks source link

Sampling gets stuck after repeated instances of "UserWarning: Random walk proposals appear to be extremely inefficient." #160

Closed asb5468 closed 4 years ago

asb5468 commented 4 years ago

Hello, I'm using dynesty for an 8-dimensional hierarchical inference problem. I have launched a set of 100 runs with the same likelihood and priors (different data), but for most of the runs, they seem to run indefinitely after throwing "UserWarning: Random walk proposals appear to be extremely inefficient. Adjusting the scale-factor accordingly." I tried restarting with reflective boundary conditions, but those runs have also gotten stuck. None of the remaining ~60 runs have been able to accept new points for over a week (the runs that did finish did so in about a day), and they got stuck in a range of dlogZs between ~0.15-7. I tried digging into where this happens in the code, and it looks like the scale factor is reduced each time this warning comes up. To me it seems like this would be isolating the sampler into a false maximum, since if it were narrowing in on the true peak, a new point would have been accepted or the dZ threshold would have been passed. Is there anything I can do to bypass this problem? My likelihood isn't abnormally peaked, and the posteriors in the trace plots look reasonable but wider that I'd expect for a finished run. Thank you!

joshspeagle commented 4 years ago

I tried digging into where this happens in the code, and it looks like the scale factor is reduced each time this warning comes up. To me it seems like this would be isolating the sampler into a false maximum, since if it were narrowing in on the true peak, a new point would have been accepted or the dZ threshold would have been passed. Is there anything I can do to bypass this problem?

You are 100% correct that repeated instances of this behavior would cause the sampler to get stuck in a false maximum. The original intent of this was that this generally would happen either close to an edge or in cases where the covariance structure has been misestimated, so shrinking the proposal allows a sample to be proposed from the current live point so that sampling will hopefully proceed as normal with the other points. In practice, however, you sometimes get behavior such as the one you are finding.

In terms of getting around this, in theory using another sampling method might resolve this (e.g., 'unif'), but that's not really much of a fix. This issue could actually evidence for a bug elsewhere: there appear to be some issues that have been hard to squash when estimating covariances, so this might just be "my covariance always makes me propose nans so I can't accept anything and get stuck in an endless loop". Knowing whether that's happening or not would be useful here if you can pull out some of that information for those runs. The other fix would also be on my end: I could try and rewrite some of the internals of the code to allow for "restarts" so that after failing to propose a new position from a random live point you just pick a new one at random and try again. I think such a change to the internals would be possible.

asb5468 commented 4 years ago

Thanks for the quick reply! I did take a look at the values of the likelihood and parameters that it was trying, and the log-likelihoods were finite in a narrow range with a scatter of around 4. The parameters weren't very near the true values that I used when simulating the data. I think this means it's not getting stuck in a wonky part of the parameter space that produces nans. In the meantime I'll try with uniform sampling!

joshspeagle commented 4 years ago

Okay, good to know. If this problem continues to arise I am happy to troubleshoot fixes. Best of luck!

nikhil-sarin commented 4 years ago

Hiya I am also experiencing the same issue, with a similar diagnosis as @asb5468, I'll try the 'unif' method but an internal fix to randomly choose a different point would be great!

joshspeagle commented 4 years ago

Okay, I should definitely look into this sooner rather than later then. It seems more people have been having this issue after the overhaul associated with PRs refactoring the periodic and reflective boundary conditions, so I should check everything is working as intended. I’ll try to get to this ASAP.

joshspeagle commented 4 years ago

Okay, I don't see any obvious bugs in the rwalk method, but if either of you have a copy of the results object saved from one of these stalled runs that you can send me that would be extremely helpful diagnosing any possible problems that I might have missed in the actual proposal (new eyes tend to catch things) or possible problems that might be occurring upstream.

Regardless, introducing a new feature to allow for retries (with an associated warning) should be a good addition. Functionally, I should be able to add in a quick check here when new points are proposed that checks whether an evaluation "failed" (based on a new flag I can add if, say, the method tries and fails to find a new point after N attempts, where N is say 5), removes it from the queue if so, then restarts the whole process of proposing new points if the queue is empty. This should work well with parallelized proposals by just removing any particular ones that failed without requiring a full replacement of the associated queue. Sound like a reasonable implementation?

asb5468 commented 4 years ago

Yep, I was thinking of a similar work-around, but I guess that N should be chosen so that this wouldn't cause the logic to backfire in the original cases you envisioned for reducing the scale factor of the ellipse. I've attached a compressed pickled resume file (can be unpickled with python2.7) for one of my runs that got stuck like this. Let me know if you have trouble opening it! power_law_99_hyper_dyn_resume.pickle.zip

joshspeagle commented 4 years ago

Okay, I've played around with the results for a bit. I think part of the problem here is that some of the parameters appear to remain totally unconstrained, and those that are constrained appear to be hitting the edge. I've just plotted two of them below. image image

This looks like it's messing with the ellipsoid decomposition and proposals, as seen in the distribution of live points. image

The scale factor then attempts to adjust to compensate. image

What really appears to be the issue is that either a few of the decompositions are terrible or the associated live points are on the top of an island of absolutely amazing log-likelihood (most likely the former). So while things run OK most of the time, you indeed get these humongous spikes in the number of likelihood calls per iteration where it stalls: image

So I definitely can confirm the problem. I think this is a result of the ellipsoid decomposition not performing well on this problem (and/or being unstable, or both), leading to occasionally awful proposals, and the code attempting to compensate. Introducing retries should definitely resolve this, so I'll try and patch that in at some point soon. Alternately, it might also be worth checking out whether playing around with vol_dec and vol_check (to make the decompositions more/less aggressive) might help with this, since you might either want much more aggressive clustering (to capture this substructure) or more conservative clustering (to just probe the large-scale structure).

Thanks a lot for sending those results my way!

joshspeagle commented 4 years ago

I'm closing this for now. Hopefully the changes in #163 and resolving #78 and #140 should also fix this.

3fon3fonov commented 2 years ago

I am afraid I have to re-open this. With dynesty 1.2, (Dynamic nested sampling, rwalk, multi bound, posterior frac = 1.0, dlogz stop=0.1) I am way too often seeing the following (example):

22895it [32:30,  3.07it/s, batch: 0 | bound: 280 | nc: 148 | ncall: 600342 | eff(%):  3.810 | loglstar:   -inf < -13670.159 <    inf | logz: -13713.353 +/-  8.960 | dlogz: 401.066 >  0.100]../../exostriker/lib/dynesty_1_2/sampling.py:243: UserWarning: Random walk proposals appear to be extremely inefficient. Adjusting the scale-factor accordingly.
  warnings.warn("Random walk proposals appear to be "
../../exostriker/lib/dynesty_1_2/sampling.py:243: UserWarning: Random walk proposals appear to be extremely inefficient. Adjusting the scale-factor accordingly.
  warnings.warn("Random walk proposals appear to be "
../../exostriker/lib/dynesty_1_2/sampling.py:221: UserWarning: Random number generation appears to be extremely inefficient. Adjusting the scale-factor accordingly.
  warnings.warn("Random number generation appears to be "
22990it [44:01,  3.23s/it, batch: 0 | bound: 283 | nc: 25 | ncall: 611052 | eff(%):  3.759 | loglstar:   -inf < -13655.916 <    inf | logz: -13698.388 +/-  8.960 | dlogz: 445.853 >  0.100]

As you can see, it works, then it goes to the error, but after some time, it starts sampling again. However, often it sticks to an infinite number of the UserWarning, and the sampler stacks. This is mostly happening at the final stages when in constructs the posteriors. The only solution I found is to observe the ncall before it sucks and reset with maxcall=ncall before it sticks, but this is super time-consuming. Some of my setups are running for days on a 40 cpu machine, so you can imagine the frustration when I have to restart the run.

Also, for my problem(s) I find that rwalk is the fastest and most efficient sampling option, thus I would really appreciate help with this.

segasai commented 2 years ago

@3fon3fonov You are clearly not using the latest released version as 1.2.3 version does not have that warning

3fon3fonov commented 2 years ago

@segasai, I must admit, this is true! Unintentionally, I must have reverted to an older dynesty version. I realized this after I posted above.... So it seems updating to 1.2.3 is working fine now, but I am still experimenting. Let's see how this will go :) Else, sorry for the spam!