calculation of H and uncertainties.

joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences

https://dynesty.readthedocs.io/

MIT License

357 stars 77 forks source link

calculation of H and uncertainties. #306

Closed segasai closed 3 years ago

segasai commented 3 years ago

Several things emerged in the series of commits, where I tried to separate the repeated calculations of logz/logwt/h into one function compute_integrals() https://github.com/joshspeagle/dynesty/pull/305/commits/6389e923872deb0c027735923243fc0608d1c8fd

1) Several places use different multiplier for the logzvar calculation (either one or two). I don't think I still understand the derivation of that, so the questions is which one it is. I'm assuming it needs to be the same everywhere ?

2) Another thing is that I have the same code that I used for jitter_run refactor, which is a vectorized version of the calculation.

def compute_integrals(logl=None, logvol=None):
    # Compute weights using quadratic estimator.                                
    loglstar_pad = np.concatenate([[-1.e300], logl])

    # we want log(exp(logvol_i)-exp(logvol_(i+1)))                              
    # assuming that logvol0 = 0                                                 
    # log(exp(LV_{i})-exp(LV_{i+1})) =                                          
    # = LV{i} + log(1-exp(LV_{i+1}-LV{i}))                                      
    # = LV_{i+1} - (LV_{i+1} -LV_i) + log(1-exp(LV_{i+1}-LV{i}))                
    dlogvol = np.diff(logvol, prepend=0)
    logdvol = logvol - dlogvol + np.log1p(-np.exp(dlogvol))

    # logdvol is log(delta(volumes)) i.e. log (X_i-X_{i-1})                     
    logdvol2 = logdvol + math.log(0.5)
    # These are log(1/2(X_(i+1)-X_i))                                           

    dlogvol = -np.diff(logvol, prepend=0)
    # this are delta(log(volumes)) of the run                                   

    # These are log((L_i+L_{i_1})*(X_i+1-X_i)/2)                                
    saved_logwt = np.logaddexp(loglstar_pad[1:], loglstar_pad[:-1]) + logdvol2
    saved_logz = np.logaddexp.accumulate(saved_logwt)
    # This implements eqn 16 of Speagle2020                                     

    logzmax = saved_logz[-1]
    # we'll need that to just normalize likelihoods to avoid overflows          

    # H is defined as                                                           
    # H = 1/z int( L * ln(L) dX,X=0..1) - ln(z)                                 
    # incomplete H can be defined as                                            
    # H = int( L/Z * ln(L) dX,X=0..x) - z_x/Z * ln(Z)                           
    h_part1 = np.cumsum(
        (np.exp(loglstar_pad[1:] - logzmax + logdvol2) * loglstar_pad[1:] +
         np.exp(loglstar_pad[:-1] - logzmax + logdvol2) * loglstar_pad[:-1]))
    # here we divide the likelihood by zmax to avoid to overflow                
    saved_h = h_part1 - logzmax * np.exp(saved_logz - logzmax)
    # changes in h in each step                                                 
    dh = np.diff(saved_h, prepend=0)

    saved_logzvar = np.cumsum(dh * dlogvol)
    return saved_logwt, saved_logz, saved_logzvar, saved_h

While this code gives exactly the same answer for wt, logz as the exiting code for compute_integrals() , it differs for h and logzvar. And the reason I think is different assumptions what is H_i ? In my implementation (and that's how interpreted A47 from your paper) H_i are integrals (A42) over the volume where L<L_i (or X>X_i)

I've done the test on a single 2d Gaussian centred in a cube (which is now easy to do with the dedicated function)

s0=0.01
N=10000;
x1=0.5;
x2=1e-30;
rs=np.exp(np.linspace(np.log(x1),np.log(x2),N));
logvol=np.log(np.pi*rs**2);
logl=np.log(1/(2*np.pi)/s0**2)-0.5*(rs/s0)**2;
R=dynesty.utils.compute_integrals(logl=logl,logvol=logvol);

The current code https://github.com/segasai/dynesty/blob/fe65dbcf84eca22d701bf634dbc3e1272dc2aec0/py/dynesty/utils.py#L588 gives the following H values

array([2.23211807, 5.23347535, 5.24716771, ..., 6.37243212, 6.37243212,
       6.37243212])

While my implementation (shown above) gives this

array([-0.        , -0.        , -0.        , ...,  6.37243212,
        6.37243212,  6.37243212])

This certainly makes more sense to me as the H integral should start from zero, while the current calculations do not. Both provide the same final answer though. Obviously because the H() values are different, the errors will be different too. So the question is whether the current code is correct or not (wrt to H calculation).

3) The final point is that now given the code that converts {L_i}, {Vol_i} into Z is separated one can potentially have a separate validation/testing of it (especially the uncertainty part)

joshspeagle commented 3 years ago

Several places use different multiplier for the logzvar calculation (either one or two). I don't think I still understand the derivation of that, so the questions is which one it is. I'm assuming it needs to be the same everywhere ?

The extra factor of 2 should be used everywhere where the total (both sampling and statistical) uncertainties in logz need to be used for calculations. I believe this is probably everywhere, but could you point me to where the current code does not include the factor of 2?

Another thing is that I have the same code that I used for jitter_run refactor, which is a vectorized version of the calculation.

Glancing things over, your version should be a more stable estimator (in addition to being cleaner and vectorized), so I think we should use that instead.

The final point is that now given the code that converts {L_i}, {Vol_i} into Z is separated one can potentially have a separate validation/testing of it (especially the uncertainty part)

Yes, this is quite a nice feature. Thanks for adding it in!

segasai commented 3 years ago

Several places use different multiplier for the logzvar calculation (either one or two). I don't think I still understand the derivation of that, so the questions is which one it is. I'm assuming it needs to be the same everywhere ?

The extra factor of 2 should be used everywhere where the total (both sampling and statistical) uncertainties in logz need to be used for calculations. I believe this is probably everywhere, but could you point me to where the current code does not include the factor of 2?

Here's one example. (unravel_run) https://github.com/joshspeagle/dynesty/pull/305/commits/fe65dbcf84eca22d701bf634dbc3e1272dc2aec0 I'm happy to keep the factor of two everywhere.

Another thing is that I have the same code that I used for jitter_run refactor, which is a vectorized version of the calculation.

Glancing things over, your version should be a more stable estimator (in addition to being cleaner and vectorized), so I think we should use that instead.

Yes, but what concerns me still is the drastically different H(i) and different values for variances.

joshspeagle commented 3 years ago

Here's one example. (unravel_run) fe65dbc I'm happy to keep the factor of two everywhere.

Okay, that's definitely a typo if I don't throw in the factor of two later. Thanks for catching those and adding them in.

Yes, but what concerns me still is the drastically different H(i) and different values for variances.

And the reason I think is different assumptions what is H_i ? In my implementation (and that's how interpreted A47 from your paper) H_i are integrals (A42) over the volume where L<L_i (or X>X_i)

I think this is exactly it. In the proposed version, H_i should increase the same way Z_i should increase (monotonically from 0). The arguments relating H_i to the variance in logZ_i are sound, so I think the change makes sense.

joshspeagle commented 3 years ago

I think the real test is to confirm the answers are the same for a dynamic sampling run where the distribution of volume spacings varies.

segasai commented 3 years ago

I have hacked a simple test case that samples an n-d gaussian (currently with fixed number of points) Here we can actually assess the variances https://colab.research.google.com/drive/1r4LDG4I7CBTieWPrWpG8vOgrhINsZ4UC?usp=sharing In the very bottom of the notebook I randomly resample it and run the compute_integrals code. My vectorized version certainly gives a correct final variance (but not throughout the run). The compute_integrals that is based on the existing dynesty code is somewhat off.

segasai commented 3 years ago

I think the fact that variance of log_Z_i actually decreases with iterations, shows that the formalism presented in the appendix A* is not quite correct I think. You can't really think of logZ as some RV that keeps accumulating noise. (partly because I think the iteration i will in general correspond to quite different logL levels).

joshspeagle commented 3 years ago

I think the fact that variance of log_Z_i actually decreases with iterations, shows that the formalism presented in the appendix A* is not quite correct I think. You can't really think of logZ as some RV that keeps accumulating noise.

Yes, exactly. I hope I mention that explicitly in the text since it's a strong (and incorrect) assumption that I only use to derive the simple approximation.

The compute_integrals that is based on the existing dynesty code is somewhat off.

It's off by the added factor of 2, which is expected here. Once you remove those the results agree.

segasai commented 3 years ago

Regarding this factor of two and the A6.2 section of the paper. I'm not sure I fully follow it, but doesn't the test https://colab.research.google.com/drive/1r4LDG4I7CBTieWPrWpG8vOgrhINsZ4UC?usp=sharing invalidate it? I.e. the actual uncertainty doesn't seem to need a factor of two. And my understanding of Skilling's argument was that the accumulated volume error (A6.1) dominates.

joshspeagle commented 3 years ago

See this notebook, which runs through the breakdown in detail. (This was before I included the factor of two by default, IIRC.) The extra factor arises from either dealing with sampling variance (in possible paths), statistical uncertainties in the lnX, or both. The notebook validates the results against repeat runs.

segasai commented 3 years ago

Yes, I've seen that notebook, but I feel it has too many moving parts. But okay, I've put my notebook in a script and when calculating sqrt(<(logz-logztrue)^2/logvar>):

original(with 2) 1.65 new(without 2) 2.34 But when I look at Stddev((logz-mean(logz))/logzvar^.5) Original (with 2) 0.7 New (without 2) 0.99

So the factor of two makes the errors better in comparison to the true value, but But in terms of characterizing spread from different full reruns the factor of two is incorrect I think. And if I understand correctly the error is supposed to be measuring the random aspect of the calculation, not the bias across multiple reruns. (but maybe I'm missing something)

See code below

import numpy as np
import scipy.special
import matplotlib.pyplot as plt
import math

def logvol(r, ndim):
    logv = (ndim / 2.) * np.log(
        np.pi) - scipy.special.gammaln(ndim / 2. + 1) + ndim * np.log(r)
    return logv

def logl(r, ndim, s0):
    return -0.5 * ndim * np.log(
        2 * np.pi) - ndim * np.log(s0) - 0.5 * (r**2) / s0**2

def simmer(ndim=2, logvol0=0, npt=100, niter=10000, s0=1e-2):
    cubestart = True

    if cubestart:
        xs = np.random.uniform(size=(npt, ndim))
        rs = ((xs - .5)**2).sum(axis=1)**.5
    else:
        rs = np.random.uniform(size=ndim)**(1. / ndim) * .5
    # sphere inside the cube
    logls = logl(rs, ndim, s0).tolist()
    logvols = logvol(rs, ndim).tolist()
    rs = rs.tolist()
    retlogls = []
    retlogvols = []
    retrs = []
    for i in range(niter):
        ind = np.argmin(logls)

        retlogls.append(logls[ind])
        retlogvols.append(logvols[ind])
        retrs.append(rs[ind])
        # print(logls[ind])
        rworst = rs[ind]
        # r^(ndim-1) distribution in for uniform
        rnew = rworst * np.random.uniform()**(1. / ndim)
        loglnew = logl(rnew, ndim, s0)
        logvolnew = logvol(rnew, ndim)
        del logls[ind]
        del rs[ind]
        del logvols[ind]

        rs.append(rnew)
        logvols.append(logvolnew)
        logls.append(loglnew)

    retrs, retlogls, retlogvols, = [
        np.array(_) for _ in [retrs, retlogls, retlogvols]
    ]
    return retrs, retlogls, retlogvols

def domany(ntries=100, niter=2000, ndim=10, npt=100):
    rs, ls, vs = [], [], []
    for i in range(ntries):
        r, logl, logv = simmer(niter=niter, ndim=ndim, npt=npt)
        rs.append(r)
        ls.append(logl)
        vs.append(logv)
    return [np.array(_) for _ in [rs, ls, vs]]

npt = 100
ndim = 10
nit = 10000
ntries = 1000

r, l, v = simmer(niter=nit, ndim=ndim, npt=npt)
rs, ls, vs = domany(ntries=ntries, niter=nit, ndim=ndim, npt=npt)

def newcompute_integrals(logl=None, logvol=None):
    # Compute weights using quadratic estimator.
    loglstar_pad = np.concatenate([[-1.e300], logl])

    # we want log(exp(logvol_i)-exp(logvol_(i+1)))
    # assuming that logvol0 = 0
    # log(exp(LV_{i})-exp(LV_{i+1})) =
    # = LV{i} + log(1-exp(LV_{i+1}-LV{i}))
    # = LV_{i+1} - (LV_{i+1} -LV_i) + log(1-exp(LV_{i+1}-LV{i}))
    dlogvol = np.diff(logvol, prepend=0)
    logdvol = logvol - dlogvol + np.log1p(-np.exp(dlogvol))

    # logdvol is log(delta(volumes)) i.e. log (X_i-X_{i-1})
    logdvol2 = logdvol + math.log(0.5)
    # These are log(1/2(X_(i+1)-X_i))

    dlogvol = -np.diff(logvol, prepend=0)
    # this are delta(log(volumes)) of the run

    # These are log((L_i+L_{i_1})*(X_i+1-X_i)/2)
    saved_logwt = np.logaddexp(loglstar_pad[1:], loglstar_pad[:-1]) + logdvol2
    saved_logz = np.logaddexp.accumulate(saved_logwt)
    # This implements eqn 16 of Speagle2020

    logzmax = saved_logz[-1]
    # we'll need that to just normalize likelihoods to avoid overflows

    # H is defined as
    # H = 1/z int( L * ln(L) dX,X=0..1) - ln(z)
    # incomplete H can be defined as
    # H = int( L/Z * ln(L) dX,X=0..x) - z_x/Z * ln(Z)
    h_part1 = np.cumsum(
        (np.exp(loglstar_pad[1:] - logzmax + logdvol2) * loglstar_pad[1:] +
         np.exp(loglstar_pad[:-1] - logzmax + logdvol2) * loglstar_pad[:-1]))
    # here we divide the likelihood by zmax to avoid to overflow
    saved_h = h_part1 - logzmax * np.exp(saved_logz - logzmax)
    # changes in h in each step
    dh = np.diff(saved_h, prepend=0)

    saved_logzvar = np.cumsum(dh * dlogvol)
    return saved_logwt, saved_logz, saved_logzvar, saved_h

import dynesty

logvolfake = -1. / npt * np.arange(nit)

logzs = np.zeros(ntries)
logzvo = np.zeros(ntries)
logzvn = np.zeros(ntries)

for i in range(ntries):
    _, logz, logzvarold, _ = dynesty.utils.compute_integrals(logl=ls[i],
                                                             logvol=logvolfake)
    _, logz, logzvarnew, _ = newcompute_integrals(logl=ls[i],
                                                  logvol=logvolfake)
    logzs[i] = logz[-1]
    logzvo[i] = logzvarold[-1]
    logzvn[i] = logzvarnew[-1]

print('original', np.mean((logzs**2 / np.sqrt(logzvo))**2)**.5)
print('new', np.mean((logzs**2 / np.sqrt(logzvn))**2)**.5)
print('stds', np.std(logzs))
print('old med err', np.median(logzvo)**.5)
print('new med err', np.median(logzvn)**.5)
print('original ', np.std((logzs-np.mean(logzs))/np.sqrt(logzvo)))
print('new', np.std((logzs-np.mean(logzs))/np.sqrt(logzvn)))

joshspeagle commented 3 years ago

And if I understand correctly the error is supposed to be measuring the random aspect of the calculation, not the bias across multiple reruns.

Yes, so this is essentially what things come down to. The procedure to generate an estimate of logZ involves: 1) generate a sequence of logLs 2) associate a set of logXs with them 3) compute the integral

(3) just involves standard numerical integration errors. (2) involves random statistical errors because we only have a random estimator for logX (although we know its distribution and by default use the geometric/arithmetic mean). (1) involves sampling variance, since you can generate different sequences of different lengths (and different spacings). Skilling's argument for the uncertainty follows from (1), but you can also derive similar estimates following (2). Hence the rough argument that the two sources of uncertainty are comparable, giving the factor of ~2.

Ed Higson also has a nice paper looking into the sources of these dual sources of uncertainty (which also includes a few references to other studies on the topic).

Ultimately, the extra factor of 2 is just meant to be an approximation to the "true" uncertainty (in a frequentist sense); the real estimates should be generated using many realizations of simulate_run (following the strategy outlined in Higson's paper, more or less).

segasai commented 3 years ago

Okay, I think I look at the problem differently.

In my mind the only random process is the generation of V_i -- these are true volumes. And those are generated from the known Skilling random process. The likelihood values L_i are just functions of those L_i=L(V_i). But we do not observe V_i we substitute them by X_i=exp(-i/M). So in the main cause for uncertainty is this random unobserved vector {V_i} (it's only observed through L_i)

So I don't really see the difference between 1 and 2.

joshspeagle commented 3 years ago

So in the example notebook, you can actually see this when you compare between the six cases:

Approx.: -8.670 +/- 0.207 [original errors (no x2 variance); uses E[logX] values] Sim.: -8.696 +/- 0.192 [samples logX values] Resamp.: -8.676 +/- 0.180 [bootstrap resampling; uses E[logX] values] Rep. (mean): -8.912 +/- 0.211 [250 repeated runs; uses E[logX] values] Comb.: -8.699 +/- 0.262 [bootstrap resampling + samples logX values] Rep. (sim.): -8.946 +/- 0.289 [250 repeated runs; samples logX values]

You're essentially comparing Approx. to Rep. (mean) and finding (as I did) that the estimate of the uncertainty is very consistent! So we're in perfect agreement on what you're finding.

The difference comes when you look a little deeper at the estimator. As described in A28 and A29 of the dynesty paper, what we're doing is substituting in the mean of E[logX] (or E[X]) when computing our logZ estimate. This means we're computing E[logZ], more or less. So to simulate the error in logZ, what we should do is generate a realization of the true underlying X's from the (known) true underlying distribution(s), then use those in the integral to compute a new estimate of logZ. Do this a bunch of times and we can in theory empirically determine Var[logZ] (and confirm the validity of our E[logZ] estimate). This is the Sim. result, which I find is consistent with Approx.. This I think tracks with how you're looking at the problem.

Of course, another way to check the validity of the estimate is just to do a bunch of repeat trials: run the estimator a bunch of times, and then empirically verify its variance. This is what is shown in Rep. (mean), and is what you've done by using logvolfake = -1. / npt * np.arange(nit) for all the simulated runs. This, however, isn't an estimate of Var[logZ] directly -- instead, since our estimator for each run is E[logZ], it's instead measuring Var[E[logZ]].

To remedy this, we can again just replace our previous estimates of E[logX] for each run with some sampled logX values, and then compute a corresponding logZ value (instead of E[logZ]). Then we can combine those estimates to compute Var[logZ] from across the runs. This is the Rep. (sim.) result. As you might expect, given that this leads to variation in the logZ for each run (based on the Sim. results), that we've added together variance in the estimated means (from the Rep. (mean) results), and that the two sources of error are somewhat independent and similar in value, the final error is ~sqrt(2) times bigger and can be simulated by combining the two processes together (Comb.).

Does that make sense? Or is there still disagreement about these results and how we should be reporting errors? (There are arguments to leaving out the factor of 2, which is why I originally did not include it in the first releases IIRC.)

segasai commented 3 years ago

Yes, I think we are on the same page here more or less.

Assuming that F(V,L(V)) is our estimator that takes vector of Volumes and vector of Likelihoods (V is an RV) In practice we evaluate it as F(X,L(V)) since we don't have access to V directly ( (X are are approximate volumes and is not even an RV, its a constant vector and X=E[V]), The uncertainty from Skilling estimates provides a good estimate for the Var[F(X,L(V)] But what we care about is E[(F(X,L(V)]-F_true)^2] which is equal to Var[F(X,L(V)] + (E[F(x,L(V)]-F_true)^2 which is sum of bias and variance. And that's why when we just use Skilling formula we underestimate the uncertainty, because we are missing the bias term.

I am not sure it quite make sense to include the bias into the final errorbar, but it is not unreasonable. It seems possible to me though to just find this bias though and correct for it ( I need to think a bit about it).

joshspeagle commented 3 years ago

As my old adviser would say, it looks like we are in "violent agreement". 🙂

And yes -- you've essentially rehashed the original reason why I didn't include that term originally. The consensus I got was that people preferred to have the "total" error reported, so I ended up adding it in by default rather than just warning people about it. Maybe the best solution is to just add an arg that doubles the variance when grabbing the results; otherwise, we just ignore this internally. I think that is also conceptually (and internally) cleaner as well.

If the updated calculations also work just fine for the dynamic nested sampling case where the spacing between samples can be variable, then I'm happy to look into merging this in. Does that seem reasonable to you, or would you prefer to just remove some of these things altogether?

segasai commented 3 years ago

Great.

I don't think there are any outstanding issues. I've fixed a couple of loose ends. I've replaced the compute_integrals by my vector calculation and used the compute_integrals in couple of other places to avoid duplicate code. I've also moved a single step of integration into a separate function. To deal with the 1/2 factor I've added the option bias_var which if true uses a factor of two, if false it does not. At the moment I have not turned on that option anywhere (there are 8 locations where the calculation is done and where the decision needs to be made). We can just make the default value of True... I think this is ready to be reviewed and merged (to avoid further accumulation of stuff in this branch)

segasai commented 3 years ago

I revoke my agreement on the bias term and factor of two.

It turns out that the previous bias I've seen in the toy case
https://colab.research.google.com/drive/1RW5VwlizfdsnU7cA-TdiWHKxSLvto1z7?usp=sharing

was caused by imperfect sampling at the beginning of the run (when the sphere intersects the boundary cube). After addressing that the bias essentially disappears. I.e. I'm making a claim that if I have a sample that perfectly satisfies the nested sampling conditions (i.e independent/uniform subject L>Lx condition), then the estimator of variance does not need extra factor of two)

And in fact When I code sampling of a Gaussian with dynesty with either (uniform sampler + bootstrap=5) or (rslice with large number of steps) the bias is indeed tiny/(negligible) (1/6th of the stddev).

import dynesty
import numpy as np

nlive = 100

s0 = 0.01
ndim = 10
def loglike(x):
    ndim = len(x)
    return -0.5 * ndim * np.log(2 * np.pi) - ndim * np.log(s0) - 0.5 * np.sum(
        (x - 0.5)**2) / s0**2
def prior_transform(x):
    return x

rstate = dynesty.utils.get_random_generator(3)

nit = 100
logzs = np.zeros(nit)
for i in range(nit):
    sampler = dynesty.NestedSampler(loglike,
                                    prior_transform,
                                    ndim,
                                    nlive=nlive,
                                    rstate=rstate,
                                    bound='single',
                                    sample='unif',
                                    # sample='unif',bootstrap=5
                                    # sample='rslice',slices=50,
)
    sampler.run_nested(print_progress=False)
    print('x')
    logzs[i] = (sampler.results.logz[-1])
print (logzs.mean(), logzs.std())

On the contrary If I just sample with the default parameters (unif sampler and without bootstrap or with rslice with the default params) I get a bias of ~ 1 std dev. But because the bias disappears when ensuring proper sampling, it shows that the problem is not really because of our estimator and substituting the volumes by their expectations. Basically we have a bias, because our sampling is biased ( samples are either not independent or not uniformly distributed within the L>Lx volume)

Therefore I think the factor of two in error calculation is incorrect and is not needed. But what may be needed is a default bootstrap when using 'uniform' sampler or enlarge factor that depends on dimensions and number of points or changed defaults (as done https://github.com/joshspeagle/dynesty/issues/289 )

joshspeagle commented 3 years ago

I can look into building off of your example code to compare, but just to make sure we're all coming at this from the same place:

My (possibly flawed) understanding is just based on the law of total variance:

Var[Y]= E[Var[Y | X]] + Var[E[Y | X]]

In this general notation, I think of X being some particular realization of the samples and Y as being our estimate of the marginal likelihood (or possibly any other function of the samples).

The default NS estimator for the likelihood computes E[Y | X]: conditioned on a set of samples X, you're computing the expectation value of Y (via the expectation values of the log-volume estimates). So running a bunch of repeat trials then gives the Var[E[Y | X]] term.

Var[Y | X] is based on the uncertainties in the log-volume estimates. These can be simulated directly for a single run and through repeat runs, we can compute the expectation E[Var[Y | X]] (which should be similar to that of a single run). These terms are of similar amplitude, hence the factor of 2.

I think the key difference is what exactly Var[Y] means. Using the same notation as above, this is is the variance in the estimate Y marginalized over all possible "true" NS realizations X (i.e. all possible samples + log-volume realizations). However, dynesty returns E[Y | X] by default as an estimator. Therefore, Var[E[Y | X]] might be what you actually care about since that's the variance in the estimator rather than the underlying quantity (i.e. over all possible samples with expected log-volume values).

What are your thoughts?

segasai commented 3 years ago

I agree with the law of total variance but disagree with its application and terminology. For example I disagree that the NS estimator computes E[Y|X].

To answer let me first define very stricty the terminology.

I assume we are doing a long run of fixed length of standard static nested sampling
I'll only focus on calculation of evidence (Z)
I'll call X true unobserved realizations of volumes (its an RV and a vector generated by known random process that is completely independent of the problem at hand)
I''l call L=L(X) an RV that just evaluates the function L() on those volumes.
I'll call V the expectation of volumes (it's a constant vector)
And I'll call Y=F(L,V) our estimator. It's also an RV. F() is the numerical integration (from compute_integrals) basically.
Lets say that Y0 is a true value of what we are trying to estimate
the only random vector we receive is L
But in principle our problem has 3 RVs X,L,Y And the joint PDF is the following P(X,L,Y) = Skilling(X)\delta(L-L(X))\delta(Y-F(L,V))

Now what I claim 1) a single run of nested sampling just computes Y estimator, it doesn't compute any expectation whatsoever. And specifically it is not E[Y|X] because we do not have access to X. You can call it Y|L that is fair. And in fact E[Y|L]= Y|L, because it's just a deterministic function. In fact if you want you can apply the law of total variance here Var[Y] = E[Var[Y|L]] + Var[E[Y|L]] But because Var[Y|L] is zero because it's deterministic, this is kind'a moot.

2) Y1=F(L(X),X) estimator have almost no bias and no variance (this is an empirical observation that if your likelihoods matches the volumes, the randomness is tiny and bias is limited by integration procedure) 3) similarly Y2=F(L(V),V) estimator have almost no bias and no variance

Now we care about E[Y-Y0] (bias), E[(Y-Y0)^2] (mean square deviation), And Var[Y].

And indeed mean square deviation E[(Y-Y0)^2]= Var[y] + (E[Y-Y0])^2 The repeated runs are telling us Var[y] and E[(y-y0)^2 will be equal to Var only if there is no bias. And what my test seem to show that E[Y-Y0] is small (if the sampling is perfect) This also shows me that E[F(L,V)] =E[F(L,X)] (or they are close enough)

So Var[y] is correct description of E[(Y-Y0)^2]

Finally i think there is a lot of confusion around because of the usage of expected log-volumes, and somehow this 'expectation' is making things hard to understand, but in reality it's just a constant vector that we plug in the estimator.

segasai commented 3 years ago

Just one more remark.

It's certainly possible there are cases where the estimator Y is biased (it just doesn't seem to be the case for <=10dim gaussian and perfect sampling). In the case of bias, I'd say it may be useful to either look at the estimators like Y3 = F(L0, X) where L0 is the fixed likelihood vector from our first (and only) run, and X are random resamples of volumes from Skilling distributions. And maybe E[Y3]- Y can be useful to correct the bias.

joshspeagle commented 3 years ago

That seems reasonable. So by this logic, I think the recommendation should be users should run either jitter_run or resample_run to derive uncertainties (which either simulate the volumes or simulate many repeat runs), but not both. And the second option should be the preferred one since it also helps capture sampling variation when computing parameter estimates. (Both should give ~identical answers for logZ.)

In that case, I think the following changes can be made:

removing the factor of 2 where it appears
removing the simulate_run function, since you don't want to be applying both operations at once
modifying other functions that take ^ (such as kld_error) accordingly

segasai commented 3 years ago

A few comments. First the calculation that I considered only applies to logz calculation, but not posterior, I haven't yet thought about the second one. I don't quite know if there are any differences. Because the posterior introduces a new set of RVs (the parameter values) TBH, I don't yet have a full understanding what resample_run does, but it looks it's more like bootstrap, so it makes sense to use that to probe posterior/parameter inferences.

I think factor of two definitely should be removed.
- I think jitter+resample is not helpful. In fact, in practice it's unclear if jitter should be used for anything at all.

joshspeagle commented 3 years ago

I don't quite know if there are any differences. Because the posterior introduces a new set of RVs (the parameter values)

Higson et al. (2018) has a good discussion of the differences between the two estimators. The upshot is they should both be equivalent for logZ, but jitter will underestimate the uncertainties for functions of parameter values.

I don't yet have a full understanding what resample_run does, but it looks it's more like bootstrap

That's exactly what it is, with some small modifications to account for the dynamic nested sampling case (where it implements a stratified bootstrap, since runs that start from the prior with X_init=1 are qualitatively different than those that start interior to the prior with X_init < 1.)

Overall, sounds like a plan. In addition to the above changes, the only other thing I need to verify before merging everything in is that the integration estimate is also stable for dynamic nested sampling runs where the number of live points varies. To first order, this is the same as just letting K_i vary so the estimator becomes 1/Ki, although there are some slight nuances when K{i+1} < K_i (mentioned in the dynesty appendix and dealt with in the code if approx=False).

segasai commented 3 years ago

Overall, sounds like a plan. In addition to the above changes, the only other thing I need to verify before merging everything in is that the integration estimate is also stable for dynamic nested sampling runs where the number of live points varies. To first order, this is the same as just letting K_i vary so the estimator becomes 1/Ki, although there are some slight nuances when K{i+1} < K_i (mentioned in the dynesty appendix and dealt with in the code if approx=False).

I think the current code in the branch is pretty much what what was used in jitter_run before ( https://github.com/joshspeagle/dynesty/blob/243025cbfc5f1941e9dc1b00348c54f8b877bd6c/py/dynesty/utils.py#L518 ), so there are no logical changes there I think. ( I didn't put in there my new derivation of the variance, so I'm not quite sure what nuances you are talking about) (although the pull request is big)

And I agree the kld_error/resample/jitter stuff should be dealt separately. Ideally with some kind of validation/test

joshspeagle commented 3 years ago

Perfect. I've just compared side-by-side and that looks good, although I'll do another cross-check as I go through the individual commits on the PR. Let me know when you're happy with the error estimation changes and I can look into starting the review.

And as always, thanks for all your effort and time on this! 🙂

segasai commented 3 years ago

Perfect. I've just compared side-by-side and that looks good, although I'll do another cross-check as I go through the individual commits on the PR. Let me know when you're happy with the error estimation changes and I can look into starting the review.

I've got rid of factor of two option, so everything is ready for review.

And as always, thanks for all your effort and time on this!

Sure