Closed segasai closed 3 years ago
Agreed on both counts.
Most of these failures disappear if I put in bootstrap, so I think we should consider to use bootstrap (only) when sampler is unif. Because only then it's essential that the boundary is correct. Alternatively I was thinking of computing the expansion factor that depends on the dimension and number of points.
After that I've looked at the code, and saw this https://github.com/joshspeagle/dynesty/blob/08b2b8d46a7b173bc8049e9d865701025631e142/py/dynesty/dynamicsampler.py#L442 where we allegedly already do this. But because of this https://github.com/joshspeagle/dynesty/blob/08b2b8d46a7b173bc8049e9d865701025631e142/py/dynesty/dynesty.py#L484 And the fact that the default bootstrap=0, this was never activated... So I'll set the default boostrap to None that will allow us to automatically set it (ie None will mean automatic)
I don't think this should be closed yet. I think it is not a stopper issue anymore, but I've ran the test in loop of seeds from 0 to 99 And here are the results:
3 FAILED tests/test_dyn.py::test_dyn
4 FAILED tests/test_ellipsoid.py::test_sample
1 FAILED tests/test_ellipsoid.py::test_sample_q
9 FAILED tests/test_gau.py::test_bounding_sample[none-rslice]
11 FAILED tests/test_gau.py::test_bounding_sample[none-rstagger]
11 FAILED tests/test_gau.py::test_bounding_sample[none-rwalk]
10 FAILED tests/test_gau.py::test_bounding_sample[none-slice]
5 FAILED tests/test_gau.py::test_dynamic
2 FAILED tests/test_misc.py::test_exc
7 FAILED tests/test_periodic.py::test_periodic
I don't think any of those are actual bugs, although it's weird that so many tests involving none bound or test_periodic fail.
I think ideally we shouldn't get any failures here... Also I think test_gau (which was a first test written), uses tolerance of logz of 1 rather than logz_err, so it's not really a full test. It'd be nice to fix these things eventually.
none-[method]
in test_gau.py
are likely due to the fact that the Gaussian has large off-diagonal elements; since the proposals don't take advantage of the auto-correlation structure, they likely are strongly correlated and likely to over-estimate the evidence. Increasing the number of walks
or slices
probably should resolve those. Alternately, I don't think those combinations are all that useful since none
should ideally never be used with those settings (except implicitly by setting ncdim
); we could consider removing them.test_dyn
and test_periodic
tests would be resolved if the default number of bootstraps was increased. We're currently using 5 IIRC, whereas in Buchner (2014) the recommended number of bootstraps is substantially higher (on order 50). I think dynesty
used to use 20-ish by default?test_dynamic
in test_gau
looks like it should use logz_err
rather than just a tolerance of 1. It also calls simulate_run
, which should be removed and deprecated. Also, the tests for jitter_run
and resample_run
should probably be changed to instead use the two functions to estimate logz_err
, instead of just passing the random realization to check_results_gau
.test_exc
and test_ellipsoid
tests? Any ideas?
- The failures with
none-[method]
intest_gau.py
are likely due to the fact that the Gaussian has large off-diagonal elements; since the proposals don't take advantage of the auto-correlation structure, they likely are strongly correlated and likely to over-estimate the evidence. Increasing the number ofwalks
orslices
probably should resolve those. Alternately, I don't think those combinations are all that useful sincenone
should ideally never be used with those settings (except implicitly by settingncdim
; we could consider removing them.
I guess the goal why I added that was to have at least some code coverage for all of these combinations But we may consider moving these tests for some simpler problem...
- I suspect the
test_dyn
andtest_periodic
tests would be resolved if the default number of bootstraps was increased. We're currently using 5 IIRC, whereas in Buchner (2014) the recommended number of bootstraps is substantially higher (on order 50). I thinkdynesty
used to use 20-ish by default?
I kind'a thought 20 was overkill, but maybe we need to set that up.
- I agree the
test_dynamic
intest_gau
looks like it should uselogz_err
rather than just a tolerance of 1. It also callssimulate_run
, which should be removed and deprecated. Also, the tests forjitter_run
andresample_run
should probably be changed to instead use the two functions to estimatelogz_err
, instead of just passing the random realization tocheck_results_gau
.
I agree these need to be updated. I initially coded that in a bit of a rush, plus I wanted to speed up tests, as initially it was doing multiple runs. Also resample_run is partially used because of the code coverage.
- What is happening with the failed
test_exc
andtest_ellipsoid
tests? Any ideas? test_ellipsoid is harmless I think.
It fails like this:
> assert ((pval > 0.003) & (pval < 0.997))
E assert (0.00016418660411632424 > 0.003 & 0.00016418660411632424 < 0.997)
and I think this is harmelss.
As this should fail in ~ 0.6% of cases which is not that far of 2.5% of failures. So I was planning to bump up the p-values.
On test_exc, it's just this code
def loglike_exc(x):
r2 = np.sum(x**2)
if r2 < 0.01:
raise MyException('ooops')
return -0.5 * r2
didnt trigger the exception. so we need to bump up the 0.01 parameter
As a final push to close all the issues, I've made a PR #316 which on top of other issues include my diagnostic code for the volume performance.
This code first shows that the're been drastic improvements in the performance of the multiellipsoidal approximation thanks to this https://github.com/joshspeagle/dynesty/pull/286
Here is performance without bootstrap
(again this shows the missing fraction of points by our multiell representation in 2d or 10d as a function of nlive.
This looks sensible'ish to me. But whats' weird, is for the low dimensions 2d. The bootstrap (even boostrap=20). fails to work for ~ 100 live points.
See here And this is an example of the problematic case
Blue are 104 live points. And red are other samples from the shell
And for some reason here even with bootstrap=20, we are missing ~ 40% of volume. This is the code
rstate1=np.random.default_rng(1);
rstate2=np.random.default_rng(2);
A,volume_in,live_points,extra_samples,nell=(test_volume.doit(104,'shell',2,'multi',bootstrap=20,rstate_data=rstate1,rstate_dyn=rstate2))
In this case we are 'shredding the 104 points into 15 ellipsoids. I think overall the new ellipsoidal splitting works very well, but I'm still puzzled why here we are missing so much of volume and why bootstrapping doesn't fix that...
Further testing revealed that if I bump up bootstrap to 100, that does resove the issue with the shell I show. (and that shows that bootstrap factors distirbution may be extremely asymmetric. Which lead to another solution that I tried. It helps the rate of failures by ~ a factor few. It's cvalidation instead of bootstrap. I.e. split the points in bootstrap groups and fit using all but one group and then test on the left out. I think that's better because at least all the points are guaranteed to be in the test dataset. In my case the default bootstrap=5 for shell in 2d leads to 8% cases missing more than 10% of volume, while the cv approach seem to reduce it to 1.3% of cases. So I'm intending to implement that
The systematic sweep over ndim/shapes/nlives showed that cv-like scheme produces a bit less big outliers, but produces more small scale volume underestimates. So for the moment the idea is abandoned. Otherwise #316 is ready to be reviewed/merged.
Thanks for looking into this. I'll look into merging #316 in soon then.
As a test of latest changes I've run the test suite in a loop with different seeds.
and for example for these tests
there are ~ 15% of failures like this caused by the deviations being larger than 5 sigma
Presumably there are more. And also It's probably worth rerunning this https://github.com/joshspeagle/dynesty/issues/289 test. I haven't yet decided what's the strategy here. I think my preferred strategy is to update the defaults so that we still fall within 5 sigma, rather than bumping up the thresholds. (in fact there are a few places in the tests already where the thresholds are pretty large).