Closed segasai closed 2 years ago
This is a good suggestion and overall would definitely be a good step towards improving the stability and performance. For context, the strategy right now is to construct the bounding distributions from scratch each time using a "recursive splitting k-means" approach. This ensures repeatability by starting from the same place each time, but obviously throws out information from the last set of bounding ellipsoids. One of the biggest barriers to incorporating previous information with the current scheme is it only includes steps to further split bounds, rather than merge them.
With respect to the things you're suggesting:
Just noting that the recent fix merged as a part of #248 tangentially touches this issue (but doesn't resolve it).
Yes, I also thought that the new add_batch can potentially discover new posterior features using all the accumulated up to a point history. My only potential worry is the possible quadratic behaviour of it. I.e. add_batch relies on the whole history, which will grow with time.
I'll close this and keep the #237 as the main issue on the bounding ellipsoids discussion.
Hi Josh,
This is definitely not a bug, but just some thoughts on a sampling issues I had recently.
The problem that I've been struggling over last few months features a very complicated posterior in high-ish (11) dimensional space with very narrow and veryb broad features as well as multiple modes. A constant struggle I have with this is missing parts of the posterior due to either missed modes or incorrect approximation by bounding ellipsoids etc. A default way of dealing with this is to use large number of live points, but obviously the code doesn't scale really well with n> few thousands. An alternative way would be doing multiple runs and then merge the runs, but as far as I understand that's not really correct if the different runs discover/miss different areas of the posterior.
Also, an annoying feature of doing multiple nested runs is that the bounding information from one set of runs is completely ignored by future runs which seems like a waste. Also, as far as I understand the rejected samples from the run could also be used to refine the bounding information, because the number of rejected samples is >> number of accepted samples. Alternatively it would be good to be able to at least verify using all the samples in the run (accepted and rejected) that the bounds are correct or ideally adjust the bounds.
I don't have a concrete plan with this, but it would be nice to have 1) the list of all likelihood evaluations from a run and likelihood values to check after the run. I don't think dynesty preserves that. (doing that would probably require storing it on disk) 2) The way of verifying the bounds using all the samples and bounds from the run. I think the functionality for that kind'a exists, but I'm not sure it's scales well for a million points, and is easily accessible. That would be also a good sanity check in general of bounds. 3) After the fact updating of bounds using large number of samples collected from 1 4) Running dynesty with already defined bounds updated from 3. I don't quite know if that's possible already, but I think that's what batch runs are doing AFAIU.
It would be interesting to see what you think of this and whether trying to implement some of this functionality would be useful.