Run-to-run instability of logZ

Dynesty version Dynesty 2.1.2; pip install

Your question Hello. I am testing the consistency of the results of dynesty for a case of "blind" Bayesian inference in radial velocity time series and finding a run-to-run variability of up to 3 (RMS) when re-testing the exact same model, with the exact same data. However, the determined parameters are the same almost to numerical precision. I'm a bit at a loss of what might be going on. I wonder if I am doing something in a way that is not expected, or getting into some edge case where the default configuration is not the most adequate. I'm testing up to 1000 live points for what I think is quite a simple low-dimension case (5 dimensions). Might it be that I need more?

The whole exercise goes as follows: I have generated synthetic data that includes noise and one sinusoidal signal, easily detectable with a lomb scargle periodogram:

plot_data

Then I try to model the data with a sinusoidal, a zero point and one noise term, using a very wide prior for the frequency of the signal (U, 0-0.5). The priors are created with scipy.stats, and the likelihood function is a very simple one:

def loglike_blind(cube):
    params = cube

    l_jit = params[0]
    v0 = params[1]

    planet_params = params[2:]

    kep = x*0
    for pl in range(planets):
        kpl = planet_params[3*pl]
        ppl = 1/planet_params[3*pl+1]
        phpl = planet_params[3*pl+2]      
        t0pl = x[-1] + ppl*(phpl - 1)          
        kep += -kpl * np.sin(two_pi*(x - t0pl)/ppl)

    y1 = y - v0 - kep

    sy1= np.sqrt(sy**2 + np.exp(2*l_jit))

    return np.sum(stats.norm.logpdf(y1, loc=0, scale=sy1))

The sampler is configured simple as:

sampler = dynesty.NestedSampler(loglike_blind, prior_transform, ndim, nlive=nlive, pool=pool, queue_size=queue, sample=sample_strat)        
sampler.run_nested(dlogz=0.01,print_progress=False)

with "sample_strat" being rwalk, rslice or slice.

I tested with a number of live points from 10xNdim up to 200xNdim, with the rwalk, rslice and slice samplers. For each number of live points, I repeated the computation 5 times. With all of them I found good stability in the determination of the parameters of the system (except at very low number of live points), but not so good stability of the determination of the logZ. With the rwalk sampler, the standard deviation of the logZ of the 5 runs oscillated between 1-3, with the rslice sampler between 1-2, and with the slice sampler between 0.5 and 1. I would say a stdev of 1 is acceptable for these kind of test, but it was still much larger than the errors provided by the sampler.

At least within the number of live points I tested, I can't see any significant improvement on the consistency by adding more live points.

For comparison, I tested the same data and likelihood with emcee (Differential evolution move), and computed the logZ using the Perrakis approximation. With emcee, the run-to-run variation was smaller than 0.1 as soon as I hit 5-10 walkers per dimension. The stability of the determination of the parameters, however, was worse than the slice and rslice samplers for an equivalent number of live points.

results_1p

To show how stable the parameters are, here's another pass I did with the rwalk sampler and the evolution of the determination of all parameters in the model. In all cases, there are 5 points with 5 error bars for each run. The parameters are so equal that is difficult to see it. Even in cases with a peak-to-peak variation of logZ of 10, the determined parameters are exactly the same.

params_evol

In case it can help, here are the runplot and traceplot of two runs with very different logZ. The trace-plots of inconsistent results are fairly different, but I'm not completely sure how to interpret the differences (trace plots at the end). In one of them there seems to be sort-of a jump halfway trough the process, and the position of the importance weighted PDF moves significantly between them.

Case1: rwalk, 700 live points, run 3: logZ -244.36 +- 0.19

runplot_rwalk_700_3 traceplot_rwalk_700_3

Case2: rwalk, 700 live points, run 4: logZ -253.34 +- 0.22 runplot_rwalk_700_4 traceplot_rwalk_700_4

Hi,

Thanks for a detailed question. There are multiple points/questions/possible issues here.

Run -to run variation in logZ -- that is expected to from nested sampling as relies on prior volume estimates which have irreducable noise in them. Increasing the number of points does help, but I don't think you'll get an easy sqrt(N) scaling of logz errors.
If you want more accurate logz estimates, you should try use dynamic sampling, that will add points where needed to improve accuracy.
Can you please share your full testing code, this looks like this is good testcase.
Regarding the logz uncertainties. I did some testing a year or two ago, and was reasonanly confident that the errors are somewhat realistic, but there are certainly a lot of approximations done when computing them (and we had long discussion on their caclulation #306 ).

Do you know the true value of evidence for this particular problem ?

Thanks

Hi, thanks for the suggestions. I tested a few of them, and the results improve.

First I tried going up to 1000x Ndim live points for the regular NS, and nothing really changed except providing smoother trace plots (which I guess it's expected).

Run -to run variation in logZ -- that is expected to from nested sampling as relies on prior volume estimates which have irreducable noise in them.

I expected some run to run variation, but not that large. We usually rely in the logZ for model comparison, and having a peak-to-peak of 5 runs up to 10 makes things difficult. A difference in logZ of 5 between models is usually considered strong evidence in favour on the more significant model.

If you want more accurate logz estimates, you should try use dynamic sampling, that will add points where needed to improve accuracy.

I tried dynamic sampling with different configurations. The difference can be quite large. The results are more in line with my expectations and needs. I could not test all configurations, as the time it takes grows a lot, but what I have is:

NS Rwalk logZ stdev: 2.21 +- 0.92 NS Rslice logZ stdev: 1.69 +- 0.56 NS Slice logZ stdev: 0.89 +-0.32

dNS Rwalk 0/100 logZ stdev: 1.96 +- 0.75 (0% evidence, 100% posterior) dNS Rwalk 35/65 logZ stdev: 2.26 +- 0.96 dNS Rwalk 65/35 logZ stdev: 1.53 +- 0.68 dNS Rwalk 80/20 logZ stdev: 1.12 +- 0.36 (Beyond 80% evidence, it gets too slow to be practical)

dNS Slice 80/20 logZ stdev: 0.5 +- 0.12 (incomplete, will be running for 6 more hours).

With this results, I would say the regular NS works well for exploratory work (it's super fast), but for model comparison the dynamic NS with slice sampling, and 80% weight/stop for evidence calculation is neccesary (for my case). Which is not a problem on its own, one of the goals of this exercise was finding the sweetspot between efficiency and accuracy. Although I have to say watching it move from a couple seconds to 2-3 minutes per run made me sad.

Currently, I set the number of initial live points (nlive_init), but I don't touch the number of live points per batch and number of batches. Do you have any recommendation about how to configure these? The code spends most of the time in this part.

Can you please share your full testing code, this looks like this is good testcase.

Sure, no problem. I'll clean and comment the code, and I'll share it with you.

Do you know the true value of evidence for this particular problem ?

Unfortunately no. The data also has some white and correlated noise components (to simulate real data), which I guess would make it difficult to estimate.

I will compare the results with the output of Ultranest, Multinest and Polychord, because my collaborators use them. I'll update here once I have the results.

Cheers and thank you

Regarding the logz uncertainties. I did some testing a year or two ago, and was reasonanly confident that the errors are somewhat realistic, but there are certainly a lot of approximations done when computing them (and we had long discussion on their calculation calculation of H and uncertainties. #306 ).

My experience has been that this is usually the case, but sometimes the uncertainties will be too small. I was testing this the other day with a variation of the log-gamma problem and there were a few cases where the true errors in logZ were ~5 but the reported uncertainties were ~0.4. If this is of interest I can try to investigate this more systematically and share the results. MultiNest would report smaller errors and have logZ errors on the order of 100s or 1000s, which was my main interest at the time.

NS Rwalk logZ stdev: 2.21 +- 0.92 NS Rslice logZ stdev: 1.69 +- 0.56 NS Slice logZ stdev: 0.89 +-0.32 dNS Rwalk 0/100 logZ stdev: 1.96 +- 0.75 (0% evidence, 100% posterior) dNS Rwalk 35/65 logZ stdev: 2.26 +- 0.96 dNS Rwalk 65/35 logZ stdev: 1.53 +- 0.68 dNS Rwalk 80/20 logZ stdev: 1.12 +- 0.36 (Beyond 80% evidence, it gets too slow to be practical) dNS Slice 80/20 logZ stdev: 0.5 +- 0.12 (incomplete, will be running for 6 more hours).

I personally prefer to use rslice as it has better tuning properties. (but I also rarely go after evidence)

With this results, I would say the regular NS works well for exploratory work (it's super fast), but for model comparison the dynamic NS with slice sampling, and 80% weight/stop for evidence calculation is neccesary (for my case). Which is not a problem on its own, one of the goals of this exercise was finding the sweetspot between efficiency and accuracy. Although I have to say watching it move from a couple seconds to 2-3 minutes per run made me sad.

Hard to say if that's really too slow or not. I'd say for a single CPU, and high-dimensional problem with the estimate of the integral 2-3 minutes seems sensible.

Currently, I set the number of initial live points (nlive_init), but I don't touch the number of live points per batch and number of batches. Do you have any recommendation about how to configure these? The code spends most of the time in this part.

The number of batches should be decided automatically by the stopping criteria. Regarding the number of live-points init/batch. I'd say the init large points needs to be increased when you worry about missing posterior modes. If that's not an issue I think default would be fine.

Can you please share your full testing code, this looks like this is good testcase.

Sure, no problem. I'll clean and comment the code, and I'll share it with you.

great, thanks

Do you know the true value of evidence for this particular problem ?

Unfortunately no. The data also has some white and correlated noise components (to simulate real data), which I guess would make it difficult to estimate.

I will compare the results with the output of Ultranest, Multinest and Polychord, because my collaborators use them. I'll update here once I have the results.

Regarding the logz uncertainties. I did some testing a year or two ago, and was reasonanly confident that the errors are somewhat realistic, but there are certainly a lot of approximations done when computing them (and we had long discussion on their calculation calculation of H and uncertainties. #306 ).

My experience has been that this is usually the case, but sometimes the uncertainties will be too small. I was testing this the other day with a variation of the log-gamma problem and there were a few cases where the true errors in logZ were ~5 but the reported uncertainties were ~0.4. If this is of interest I can try to investigate this more systematically and share the results. MultiNest would report smaller errors and have logZ errors on the order of 100s or 1000s, which was my main interest at the time.

I would certainly be interested in seeing cases of wildly incorrect logz error (assuming not too pathological posterior). A couple of years ago I computed a different equation for the logz error uncertainty, but I didn't have a good testcase to see if that's an improvement comparing to what we have now. The only case where logz deviation wrt the true value much larger than uncertainty is somewhat expected is when using MCMC(rwalk/slice) proposals and not enough steps is made/or steps are too small -- that is guaranteed to bias the logz (and there is no easy fix for that other than do more MCMC steps)

Just a quick comment, after some tests finished last night. The final values for the stability I got for the rwalk, rslice and slice are

dNS Rwalk 80/20 logZ stdev: 1.12 +- 0.36 dNS Random slice 80/20 logZ stdev: 0.67 +- 0.28 dNS Slice 80/20 logZ stdev: 0.4 +- 0.15

With these results, the 3 samplers are valid for our usual criteria (Delta LogZ 5 ~ 3 sigma), with no obvious dependence on the number of live points (tested 50 to 1000, for 5 dim).

Here's the same plot as in the original post, with the evolution of the logZ and the parameters as a function of live points, this time for the DNS, with the slice sampler. The vertical axis of the logZ is the same as before, for a better visual comparison.

With this, I think my original question can be considered solved. It was all a matter of using a sub-optimal configuration for my problem. Thank you for pointing me in the right direction.

params_evol

Hard to say if that's really too slow or not. I'd say for a single CPU, and high-dimensional problem with the estimate of the integral 2-3 minutes seems sensible.

It's certainly within reason, and still much faster than doing a regular MCMC. Although the scaling to more "realistic" problem worries me a bit. This is a 5 dimensions model, and we usuallt work up to 30-50 dimensions.

*This was on 2x Epyc system with 112 cores, although with such a small problem most of the CPU was idling.

I question regarding the time it takes in different configurations. In the regular NS, the time increases with the number of live points (which I understand). In the dynamic NS the time decreases with an increased number of initial live points. Is this because the integral is better defined by the time it starts with the "dynamic" part, and then it can solve it with less batches?

With this, I think my original question can be considered solved. It was all a matter of using a sub-optimal configuration for my problem. Thank you for pointing me in the right direction.

That's great! (I saw earlier that you were testing 2.1.2 version. I would suggest double check 2.1.3; I am not expecting significant improvements/regressions, but it is better to test against the latest version of the code )

I would still appreciate if you can share your test problem. I would still like to understand what's happening with a static nested run. And it's good to have a collection of difficult problem for the test.

Hard to say if that's really too slow or not. I'd say for a single CPU, and high-dimensional problem with the estimate of the integral 2-3 minutes seems sensible.

It's certainly within reason, and still much faster than doing a regular MCMC. Although the scaling to more "realistic" problem worries me a bit. This is a 5 dimensions model, and we usuallt work up to 30-50 dimensions.

My main worry usually in sampling really high dimensional spaces is less with the dimensionality itself, but more whether there multiple modes or very complex posterior shapes, as it is very easy to miss modes in such a huge volume, but if the posterior is reasonably benign, i think it is not too hard to sample.

*This was on 2x Epyc system with 112 cores, although with such a small problem most of the CPU was idling.

I question regarding the time it takes in different configurations. In the regular NS, the time increases with the number of live points (which I understand). In the dynamic NS the time decreases with an increased number of initial live points. Is this because the integral is better defined by the time it starts with the "dynamic" part, and then it can solve it with less batches?

I don't have a formula at hand but. 1) The initial run of dynamic sampling will take equal time to the static run with the same number of live points 2) The number of required batches will be typically lower the more live points you have in your initial run.

Here's the code I'm using to run it: https://1drv.ms/u/s!AvRFptpPOuRiwrN40-Q3CRuFZAxCrw?e=VLiZ8z

The zip file includes the data (and a script to create the data) for the cases of 1, 2 and 3 signals, and the scripts and functions needed to run it. All this discussion was about the 1 signal case. I plan to test the performance with multiple signals (which I expect will be harder for the sampler).

Ps: For reference, and in case it is useful for you, I'm running the test with Ultranest (Reactive NS). The logZ consistency is similar to that of the rslice in dynamic nested sampling. From my understanding of Ultranest, this is normal.

Cheers and thanks for the help

joshspeagle / dynesty

Run-to-run instability of logZ #473