joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences
https://dynesty.readthedocs.io/
MIT License
347 stars 76 forks source link

Thoughts on reusing/improving the bounding information #232

Closed segasai closed 2 years ago

segasai commented 3 years ago

Hi Josh,

This is definitely not a bug, but just some thoughts on a sampling issues I had recently.

The problem that I've been struggling over last few months features a very complicated posterior in high-ish (11) dimensional space with very narrow and veryb broad features as well as multiple modes. A constant struggle I have with this is missing parts of the posterior due to either missed modes or incorrect approximation by bounding ellipsoids etc. A default way of dealing with this is to use large number of live points, but obviously the code doesn't scale really well with n> few thousands. An alternative way would be doing multiple runs and then merge the runs, but as far as I understand that's not really correct if the different runs discover/miss different areas of the posterior.

Also, an annoying feature of doing multiple nested runs is that the bounding information from one set of runs is completely ignored by future runs which seems like a waste. Also, as far as I understand the rejected samples from the run could also be used to refine the bounding information, because the number of rejected samples is >> number of accepted samples. Alternatively it would be good to be able to at least verify using all the samples in the run (accepted and rejected) that the bounds are correct or ideally adjust the bounds.

I don't have a concrete plan with this, but it would be nice to have 1) the list of all likelihood evaluations from a run and likelihood values to check after the run. I don't think dynesty preserves that. (doing that would probably require storing it on disk) 2) The way of verifying the bounds using all the samples and bounds from the run. I think the functionality for that kind'a exists, but I'm not sure it's scales well for a million points, and is easily accessible. That would be also a good sanity check in general of bounds. 3) After the fact updating of bounds using large number of samples collected from 1 4) Running dynesty with already defined bounds updated from 3. I don't quite know if that's possible already, but I think that's what batch runs are doing AFAIU.

It would be interesting to see what you think of this and whether trying to implement some of this functionality would be useful.

S
joshspeagle commented 3 years ago

This is a good suggestion and overall would definitely be a good step towards improving the stability and performance. For context, the strategy right now is to construct the bounding distributions from scratch each time using a "recursive splitting k-means" approach. This ensures repeatability by starting from the same place each time, but obviously throws out information from the last set of bounding ellipsoids. One of the biggest barriers to incorporating previous information with the current scheme is it only includes steps to further split bounds, rather than merge them.

With respect to the things you're suggesting:

  1. A record of all the samples was something that we considered saving since the earliest stages of development due to the possibility of using importance nested sampling (from Feroz et al. 2013), but ended up deciding against for some of the reasons who mention. This could be implemented as an option that users would have to enable and then saved to disk.
  2. In theory, I agree that functionality where you use (1) all likelihood evaluations rather than just the set of current live points and (2) the previous set(s) of bounds would be nice. There are many instances where starting from the previous bounds, especially at later stages of sampling, is a great strategy. The ability to use the rejected/past accepted points to further refine the bounds should also lead to overall stability improvements. The biggest obstacle here would be making sure this process allows bounds to maintain or even reduce the overall level of complexity to avoid "shredding" the posterior over time (e.g., by mostly increasing the number of ellipsoids during a run). I think you're right that there are some basic exploits/procedures that should allow for this and ways for this to be integrated into the objects/framework already present in the code right now, but not sure how much work it would be to implement.
joshspeagle commented 3 years ago

Just noting that the recent fix merged as a part of #248 tangentially touches this issue (but doesn't resolve it).

segasai commented 3 years ago

Yes, I also thought that the new add_batch can potentially discover new posterior features using all the accumulated up to a point history. My only potential worry is the possible quadratic behaviour of it. I.e. add_batch relies on the whole history, which will grow with time.

segasai commented 2 years ago

I'll close this and keep the #237 as the main issue on the bounding ellipsoids discussion.