Failures in tests - Githubissues

segasai commented 3 years ago

As a test of latest changes I've run the test suite in a loop with different seeds.

and for example for these tests

for a in `seq 0 35` ; do env  DYNESTY_TEST_RANDOMSEED=$a PYTHONPATH=py:tests:$PYTHONPATH pytest tests/test_dyn.py    > /tmp/log.${a} &  done

there are ~ 15% of failures like this caused by the deviations being larger than 5 sigma

============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/skoposov/curwork/dynesty
plugins: cov-2.12.1, parallel-0.1.0
collected 1 item                                                               

tests/test_dyn.py F                                                      [100%]

=================================== FAILURES ===================================
___________________________________ test_dyn ___________________________________

    def test_dyn():
        # hard test of dynamic sampler with high dlogz_init and small number
        # of live points
        ndim = 2
        bound = 'multi'
        rstate = get_rstate()
        sampler = dynesty.DynamicNestedSampler(loglike_egg,
                                               prior_transform_egg,
                                               ndim,
                                               nlive=nlive,
                                               bound=bound,
                                               sample='unif',
                                               rstate=rstate)
        sampler.run_nested(dlogz_init=1, print_progress=printing)
        logz_truth = 235.856
>       assert (abs(logz_truth - sampler.results.logz[-1]) <
                5. * sampler.results.logzerr[-1])
E       assert 0.29720712094015767 < (5.0 * 0.04555387445144435)
E        +  where 0.29720712094015767 = abs((235.856 - 236.15320712094015))

tests/test_dyn.py:39: AssertionError
=========================== short test summary info ============================
FAILED tests/test_dyn.py::test_dyn - assert 0.29720712094015767 < (5.0 * 0.04...
========================= 1 failed in 75.96s (0:01:15) =========================

Presumably there are more. And also It's probably worth rerunning this https://github.com/joshspeagle/dynesty/issues/289 test. I haven't yet decided what's the strategy here. I think my preferred strategy is to update the defaults so that we still fall within 5 sigma, rather than bumping up the thresholds. (in fact there are a few places in the tests already where the thresholds are pretty large).

joshspeagle commented 3 years ago

Agreed on both counts.

segasai commented 3 years ago

Most of these failures disappear if I put in bootstrap, so I think we should consider to use bootstrap (only) when sampler is unif. Because only then it's essential that the boundary is correct. Alternatively I was thinking of computing the expansion factor that depends on the dimension and number of points.

After that I've looked at the code, and saw this https://github.com/joshspeagle/dynesty/blob/08b2b8d46a7b173bc8049e9d865701025631e142/py/dynesty/dynamicsampler.py#L442 where we allegedly already do this. But because of this https://github.com/joshspeagle/dynesty/blob/08b2b8d46a7b173bc8049e9d865701025631e142/py/dynesty/dynesty.py#L484 And the fact that the default bootstrap=0, this was never activated... So I'll set the default boostrap to None that will allow us to automatically set it (ie None will mean automatic)

segasai commented 3 years ago

I don't think this should be closed yet. I think it is not a stopper issue anymore, but I've ran the test in loop of seeds from 0 to 99 And here are the results:

      3 FAILED tests/test_dyn.py::test_dyn
      4 FAILED tests/test_ellipsoid.py::test_sample
      1 FAILED tests/test_ellipsoid.py::test_sample_q
      9 FAILED tests/test_gau.py::test_bounding_sample[none-rslice]
     11 FAILED tests/test_gau.py::test_bounding_sample[none-rstagger]
     11 FAILED tests/test_gau.py::test_bounding_sample[none-rwalk]
     10 FAILED tests/test_gau.py::test_bounding_sample[none-slice]
      5 FAILED tests/test_gau.py::test_dynamic
      2 FAILED tests/test_misc.py::test_exc
      7 FAILED tests/test_periodic.py::test_periodic

I don't think any of those are actual bugs, although it's weird that so many tests involving none bound or test_periodic fail.

I think ideally we shouldn't get any failures here... Also I think test_gau (which was a first test written), uses tolerance of logz of 1 rather than logz_err, so it's not really a full test. It'd be nice to fix these things eventually.

joshspeagle commented 3 years ago

The failures with none-[method] in test_gau.py are likely due to the fact that the Gaussian has large off-diagonal elements; since the proposals don't take advantage of the auto-correlation structure, they likely are strongly correlated and likely to over-estimate the evidence. Increasing the number of walks or slices probably should resolve those. Alternately, I don't think those combinations are all that useful since none should ideally never be used with those settings (except implicitly by setting ncdim); we could consider removing them.
I suspect the test_dyn and test_periodic tests would be resolved if the default number of bootstraps was increased. We're currently using 5 IIRC, whereas in Buchner (2014) the recommended number of bootstraps is substantially higher (on order 50). I think dynesty used to use 20-ish by default?
I agree the test_dynamic in test_gau looks like it should use logz_err rather than just a tolerance of 1. It also calls simulate_run, which should be removed and deprecated. Also, the tests for jitter_run and resample_run should probably be changed to instead use the two functions to estimate logz_err, instead of just passing the random realization to check_results_gau.
What is happening with the failed test_exc and test_ellipsoid tests? Any ideas?

segasai commented 3 years ago

The failures with none-[method] in test_gau.py are likely due to the fact that the Gaussian has large off-diagonal elements; since the proposals don't take advantage of the auto-correlation structure, they likely are strongly correlated and likely to over-estimate the evidence. Increasing the number of walks or slices probably should resolve those. Alternately, I don't think those combinations are all that useful since none should ideally never be used with those settings (except implicitly by setting ncdim; we could consider removing them.

I guess the goal why I added that was to have at least some code coverage for all of these combinations But we may consider moving these tests for some simpler problem...

I suspect the test_dyn and test_periodic tests would be resolved if the default number of bootstraps was increased. We're currently using 5 IIRC, whereas in Buchner (2014) the recommended number of bootstraps is substantially higher (on order 50). I think dynesty used to use 20-ish by default?

I kind'a thought 20 was overkill, but maybe we need to set that up.

I agree the test_dynamic in test_gau looks like it should use logz_err rather than just a tolerance of 1. It also calls simulate_run, which should be removed and deprecated. Also, the tests for jitter_run and resample_run should probably be changed to instead use the two functions to estimate logz_err, instead of just passing the random realization to check_results_gau.

I agree these need to be updated. I initially coded that in a bit of a rush, plus I wanted to speed up tests, as initially it was doing multiple runs. Also resample_run is partially used because of the code coverage.

What is happening with the failed test_exc and test_ellipsoid tests? Any ideas? test_ellipsoid is harmless I think.

It fails like this:

>           assert ((pval > 0.003) & (pval < 0.997))
E           assert (0.00016418660411632424 > 0.003 & 0.00016418660411632424 < 0.997)

and I think this is harmelss.

As this should fail in ~ 0.6% of cases which is not that far of 2.5% of failures. So I was planning to bump up the p-values.

On test_exc, it's just this code

def loglike_exc(x):
    r2 = np.sum(x**2)
    if r2 < 0.01:
        raise MyException('ooops')
    return -0.5 * r2

didnt trigger the exception. so we need to bump up the 0.01 parameter

segasai commented 3 years ago

As a final push to close all the issues, I've made a PR #316 which on top of other issues include my diagnostic code for the volume performance.

This code first shows that the're been drastic improvements in the performance of the multiellipsoidal approximation thanks to this https://github.com/joshspeagle/dynesty/pull/286

Here is performance without bootstrap

(again this shows the missing fraction of points by our multiell representation in 2d or 10d as a function of nlive.

This looks sensible'ish to me. But whats' weird, is for the low dimensions 2d. The bootstrap (even boostrap=20). fails to work for ~ 100 live points.

See here And this is an example of the problematic case

Blue are 104 live points. And red are other samples from the shell

And for some reason here even with bootstrap=20, we are missing ~ 40% of volume. This is the code

rstate1=np.random.default_rng(1);
rstate2=np.random.default_rng(2);
A,volume_in,live_points,extra_samples,nell=(test_volume.doit(104,'shell',2,'multi',bootstrap=20,rstate_data=rstate1,rstate_dyn=rstate2))

In this case we are 'shredding the 104 points into 15 ellipsoids. I think overall the new ellipsoidal splitting works very well, but I'm still puzzled why here we are missing so much of volume and why bootstrapping doesn't fix that...

segasai commented 3 years ago

Further testing revealed that if I bump up bootstrap to 100, that does resove the issue with the shell I show. (and that shows that bootstrap factors distirbution may be extremely asymmetric. Which lead to another solution that I tried. It helps the rate of failures by ~ a factor few. It's cvalidation instead of bootstrap. I.e. split the points in bootstrap groups and fit using all but one group and then test on the left out. I think that's better because at least all the points are guaranteed to be in the test dataset. In my case the default bootstrap=5 for shell in 2d leads to 8% cases missing more than 10% of volume, while the cv approach seem to reduce it to 1.3% of cases. So I'm intending to implement that

segasai commented 3 years ago

The systematic sweep over ndim/shapes/nlives showed that cv-like scheme produces a bit less big outliers, but produces more small scale volume underestimates. So for the moment the idea is abandoned. Otherwise #316 is ready to be reviewed/merged.

joshspeagle commented 3 years ago

Thanks for looking into this. I'll look into merging #316 in soon then.

joshspeagle / dynesty

Failures in tests #310