Closed segasai closed 3 years ago
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
py/dynesty/dynamicsampler.py | 22 | 23 | 95.65% | ||
<!-- | Total: | 23 | 24 | 95.83% | --> |
Totals | |
---|---|
Change from base Build 839449013: | -0.01% |
Covered Lines: | 3364 |
Relevant Lines: | 4679 |
Okay, I just merged in the other two PRs. I'll get to reviewing this one first thing in the morning. Thanks!
Okay, I've checked through this. It's not quite right, but I think there might be some changes that can get around this thanks to the much cleaner internals.
One of the biggest constraints with the initial set of live points is that they need to be uniformly distribution in the prior volume, subject to the condition L > Lmin. With normal nested sampling, this isn't a big deal: since Lmin = 0, you just sample from the prior to start. For each batch that starts with Lmin > 0, however, this can be non-trivial. The way I tried to get around this in the previous case was in 2 steps:
nbase
live points there now are uniformly distributed within the prior subject to L > Lmin, as desired.nbatch
live points. Since these were proposed using a set of points that were uniformly distributed within the prior volume, they also should be uniformly.The new commit just takes the first nbase
dead points from the combined run. As the dead points are sorted in order of increasing log-likelihood, they will be preferentially distributed closer to the Lmin boundary rather than uniformly within the prior volume. Although these are only used to propose new live points, their use in constructing new bounds and the non-uniform starting positions may bias the distribution of new nbatch
live points.
The fix is that we should be able to actually just randomly draw a subset of nbatch
points from the full collection of saved dead points, based on their estimated position with respect to the total interior log-volume. (As opposed to just taking the first nbatch
entries.) Given we're hoping to resample points to be ~uniformly distributed (doesn't have to be perfect, just more accurate than the current implementation in this PR), I think the weights should just be proportional to the corresponding differential volume dX_i that each point occupies (with X=0 and X=1 appended on the edges).
Does that explanation and fix sound correct to you?
Oh, also with respect to the vol
comments:
Those were intended to prevent the bounding distribution from being too small by imposing an implicit minimum volume associated with each live point. The pre-existing scheme was to set those based on the number of effective iterations that would have passed to set the size, so your proposed fix based on subset[0]
would probably work just fine. (It wouldn't quite work if we defined subset
based on the random weighted sub-sampling scheme above though.)
I've found personally that many of the other bounding checks that are now present means in practice passing this argument is effectively pointless, so we could just choose to set it to 0 and avoid this whole hassle. If we do want to keep things internally consistent though, we could probably just pull directly from the saved prior volume corresponding to subset[0]
now that those are easy to access and that's more or less what we want anyways.
Does that explanation and fix sound correct to you?
Yes, I see the point. I didn't understand the uniformity requirement. I agree that some weighted subsampling from saved run should give that. I can't quickly verify what those weights should be (need to think carefully).
Regarding the minimum volume, okay let's keep it as subset[0]. But probably it's better to get rid of it at some point. So far all of that minimum volume code I've seen looked like black magic to me (i.e. the logic behind it wasn't quite clear).
I agree that some weighted subsampling from saved run should give that. I can't quickly verify what those weights should be (need to think carefully).
Right. Each point is nominally located at a particular log-volume log X_i, so it should be possible to reweight and sample from them, but it wasn't immediately obvious to me either.
Regarding the minimum volume, okay let's keep it as subset[0]. But probably it's better to get rid of it at some point.
Agreed. It's a holdover from early development building off nestle
that doesn't do much anymore and has the potential to cause more issues than it solves.
It looks like if you have posterior sample subject to L>Lmin condition we need to sample with weights 1/L_i or log weights of -logL to get just uniform distribution within prior within L>Lmin. However since the points in the run are not sample of posterior and they need to be weighted by logwt, presumably to obtain our required uniform sample from the points in the run we need to use log weights of -logl-logwt. Does it look correct ?
I also wonder if we need to worry about repeats in that sample.
Since logwt is (to first order) logL + logdX, that would give a final logwt of roughly -logdX or weights of 1/dX, which seems to line up with my general expectations. So -logl-logwt
seems correct to me.
If we want to avoid repeats something like np.random.choice(idx, p=wt, replace=False)
would probably be good enough.
Okay, I've submitted the fix. I also set pt volume to zero, because I couldn't really justify in my head any number.
Okay, this looks good and passes all the tests, so I'm happy to merge it in.
This is a possible fix to #244 but the correctness needs to be verified. (the tests do pass) Specifically I don't understand what determines the volume determination. Also, somehow previously the logic was I think too complicated relying on base_ids. I didn't see the reason for that so I just selected all the points above the lowest boundary from saved_run