Discrepancy between expected and observed coalescence time in beta coalescent for large values of alpha

Sendrowski commented 4 months ago

Hey!

There appears to be an upward bias in the simulated coalescence times of the beta coalescent. This is relative to my expectation based on the time scaling provided in the documentation. The bias starts to be noticeable from alpha=1.8 and increases substantially for values closer to 2.

Relative error between observed and expected coalescent times plotted against alpha:

Code to generate figure:

import multiprocessing as mp

import matplotlib.pyplot as plt
import msprime
import numpy as np
import scipy

def compute_beta_timescale(pop_size, alpha, ploidy):
    """
    Compute the generation time for the beta coalescent exactly as done in msprime when testing.
    https://github.com/tskit-dev/msprime/blob/804e0361c4f8b5f5051a9fbf411054ee8be3426a/verification.py#L3447
    """

    if ploidy > 1:
        N = pop_size / 2
        m = 2 + np.exp(
            alpha * np.log(2) + (1 - alpha) * np.log(3) - np.log(alpha - 1)
        )
    else:
        N = pop_size
        m = 1 + np.exp((1 - alpha) * np.log(2) - np.log(alpha - 1))
    ret = np.exp(
        alpha * np.log(m)
        + (alpha - 1) * np.log(N)
        - np.log(alpha)
        - scipy.special.betaln(2 - alpha, alpha)
    )
    return ret

def simulate_tree_height(alpha):
    """
    Simulate genealogies under the beta coalescent and compute the average tree height.
    Since we only have two lineages, the coalescence time should coincide with `compute_beta_timescale ()`.
    """

    # simulate genealogies under the beta coalescent
    g = msprime.sim_ancestry(
        samples=2,
        num_replicates=100000,
        model=msprime.BetaCoalescent(alpha=alpha),
        ploidy=1
    )

    return np.mean([ts.first().total_branch_length / 2 for ts in g])

if __name__ == "__main__":
    alphas = np.linspace(1.99, 1.999, 30)

    # simulate average tree heights in parallel
    with mp.Pool() as pool:
        heights_observed = np.array(pool.map(simulate_tree_height, alphas))

    # compute theoretical tree heights
    heights_theoretical = np.array([compute_beta_timescale(1, alpha, 1) for alpha in alphas])

    # compute relative difference between observed and theoretical
    diff_rel = (heights_observed - heights_theoretical) / heights_theoretical

    # plot relative difference against alpha
    plt.plot(alphas, diff_rel)
    plt.margins(0)
    plt.xlabel('alpha')
    plt.ylabel('diff_rel')
    plt.show()

Am I perhaps missing something?

jeromekelleher commented 4 months ago

Thanks for the report @Sendrowski - the figure seems to have gone missing, would you mind posting again please?

Sendrowski commented 4 months ago

Hm that’s interesting. Did it work this time?

jeromekelleher commented 4 months ago

No - maybe a problem with GitHub?

Sendrowski commented 4 months ago

Possibly — I managed to view it with an independent device. Here the link to the image if that is of any help.

jeromekelleher commented 4 months ago

Sorry, still not showing up for me (I got a 404 on that link)

Sendrowski commented 4 months ago

And here is an iCloud link. Hope it works :)

jeromekelleher commented 4 months ago

Screenshot from 2024-02-05 15-08-50

jeromekelleher commented 4 months ago

Any thoughts here @JereKoskela ?

JereKoskela commented 4 months ago

I ran a C++ script simulating Beta(2-a, a) random variables for a very close to 2, and relative errors of 5 or 6 per cent in an empirical mean from 10k samples are common because the true mean is so small. I think that explains what is going on below alpha = 1.995 or so.

As for the spike very close to 2, the timescale of the Schweinsberg Beta-coalescent model is unpleasant. It goes to infinity at alpha = 1 and zero at alpha = 2, so I wouldn't be surprised if numerical issues creep in near those boundaries. At the alpha = 2 boundary in particular, you're essentially sampling 0/0. The practical thing to do is just use the Hudson model, since probability of seeing a multiple merger in the the Beta-coalescent is effectively zero in that regime.

Sendrowski commented 4 months ago

I agree that a relative error of this magnitude might perhaps not seem so surprising (100k replicates in my code example), but it is also important to note that the simulated values are consistently larger than the theoretical ones (I am not taking absolute values). For alpha=1.9, I obtain a relative error of ~2.7% (1e6 replicates, very stable estimate), and for this value we have a (not so extreme) time scaling of about 0.14 relative to the Kingman coalescent. This plot is probably doing a better job (iCloud link in case you still can’t see my images):

JereKoskela commented 4 months ago

You're right @Sendrowski, I missed the lack of a modulus sign, and evidently also one zero on the number of replicates. I'll do some more digging. It looks like a very odd bug; the code just calculates the timescale and multiplies it by an Exp(1) random variable. And somehow the resulting mean does not equal the timescale.

Sendrowski commented 4 months ago

Yes, very odd indeed... The bias also appears to drop somewhat near alpha=1.99 before shooting up again. I hope it won't be very hard to find out. I didn't debug the C code, so perhaps the time scaling is simply different when calculated there. Or perhaps it is caused by including the incomplete beta function although I could not observe that when translating into native Python.

JereKoskela commented 4 months ago

Ok, this turned out to be an insufficiently sensitive polynomial approximation for evaluating (essentially) x^2 / x^2 for very small x. It was using a polynomial approximation for x < 1e-9, which I've now hiked to x < 1e-5. @Sendrowski, your Python script now yields unbiased estimators with errors on both sides of zero.

@jeromekelleher, any comments or are you happy to merge the linked PR? I've rerun the tests in verification.py for the Beta-coalescent and all still look good.

jeromekelleher commented 4 months ago

Thanks for sorting this out so quickly @JereKoskela! Change LGTM.

@Sendrowski, are you happy with the changes? Shall we push out a bugfix release?

Sendrowski commented 4 months ago

Also looks good to me! Thank you for the quick fix @JereKoskela, and I look forward to meeting you in Warwick in April. As for the release, it’s totally up to you how soon you would like to push it out.

Sendrowski commented 4 months ago

Hey again. The bias due to numerical imprecision for values of alpha very close to 2 persists, however. Maybe one should prohibit values greater than 1.99 or so to make sure this won't be causing any problems.

jeromekelleher commented 4 months ago

Sorry, still can't see the image @Sendrowski . I think maybe you're uploading a tiff instead of a png or something?

Sendrowski commented 4 months ago

Hm it should be a png, but the graph I enclosed actually looks identical to the one I originally posted.

msprime_bug

JereKoskela commented 4 months ago

Yeah, the timescale will always break close to alpha = 1 or 2. There is no way to avoid that. I suppose we could disallow values very close to the boundaries as a result. I'd like to make sure that 1.01 and 1.99 work even if they come from floating point calculations in some other script, so maybe 1.009 and 1.991?

Sendrowski commented 4 months ago

Sounds good! Of note perhaps that I could not observe the same kind of numerical instability for values near 1 — at least for similar distances to the boundary.

JereKoskela commented 4 months ago

Thanks, that's useful to know. It's possible that we can get much closer to 1 than 2, but eventually everything will be infinite for alpha too close to 1. I'll see if I can find a reasonable lower bound.

jeromekelleher commented 4 months ago

Shall we reopen this issue, or make a new one?

JereKoskela commented 4 months ago

I've reopened this one. I'm hoping to get to this within a few days.

tskit-dev / msprime

Discrepancy between expected and observed coalescence time in beta coalescent for large values of alpha #2256