revbayes / revbayes

Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language
http://revbayes.com
GNU General Public License v3.0
57 stars 25 forks source link

Probabilities greater than 1 for empirical taxon sampling with dnEpisodicBirthDeath #382

Open Jeremy-Andreoletti opened 1 year ago

Jeremy-Andreoletti commented 1 year ago

Describe the bug The code for empirical taxon sampling (clade-specific rho) with dnEpisodicBirthDeath throws an unexpected exception: "Problem in computing the probability of missing species in BDP."

To Reproduce From the tutorial Diversification Rate Estimation with Missing Taxa, download mcmc_EBD.Rev and primates_tree.nex.

Then, follow the tutorial instruction in the "Empirical Taxon Sampling" section to modify mcmc_EBD.Rev (btw there is a typo, "Varecia_variegata_variegata" should be "Varecia_variegata"), or directly use mcmc_EBD_CSrho.Rev.txt.

Run mcmc_EBD_CSrho.Rev.

Expected behavior Running the MCMC.

Computer info MacOS Ventura 13.5.2 (Apple M1 Pro) RevBayes version (1.2.1), and I also tried with the developer branch.

Additional context The error arises in the computeLnProbabilityTimes function (in BirthDeathProcess.cpp).

for (size_t i=0; i<incomplete_clades.size(); ++i)
{
    // ...

    double p_0_T = 1.0 - pSurvival(0,present_time,1.0) * exp( rateIntegral(0,present_time) );
    double p_0_t = 1.0 - pSurvival(last_event_time,present_time,1.0) * exp( rateIntegral(last_event_time,present_time) );
    double log_F_t = log(p_0_t) - log(p_0_T);

    if ( log_F_t > 0.0 )
    {
        throw RbException("Problem in computing the probability of missing species in BDP.");
    }

    // ...
}

I've observed that sometimes pSurvival values can be very low (down to e-300) and exp(rateIntegral) very high (up to exp(600)), so I suppose this can lead to numerical instabilities when p_0_T and p_0_t are expected to be very close.

The calculations are based on "Inferring Speciation and Extinction Rates under Different Sampling Schemes" and Likelihood Inference of Non-Constant Diversification Rates with Incomplete Taxon Sampling (@hoehna).

bredelings commented 1 year ago

Thanks for the analysis. I am not familiar with this code, but I can see how it could be unstable. You could get an even better idea what is going on by printing the value of the variables when something goes wrong. For example:

    throw RbException()<<"Problem in computing the probability of missing species in BDP.    log_F_t = "<<log_F_t;
bredelings commented 1 year ago

For the issue that you mentioned, one improvement could be to replace:

    double p_0_T = 1.0 - pSurvival(0,present_time,1.0) * exp( rateIntegral(0,present_time) );

with

    double p_0_T = 1.0 - exp( lnProbSurvival(0,present_time,1.0) + rateIntegral(0,present_time) );

I think there is a but in lnProbSurvival where it does return 0.0;, but it should do return log(0.0);

You would then make the equivalent change for p_0_t. I'm not sure if this will work or not -- ultimately lnProbSurvival( ) calls computeProbabilitySurvival(start, end) which is not on the log scale.

bredelings commented 1 year ago

You could then replace 1.0 - exp( thing ) with -expm1( thing ). That will help numerical stability when thing is very small.

Jeremy-Andreoletti commented 1 year ago

Thanks for the advice, I was optimistic that it might work but it doesn't seem to be enough.

double p_0_t = -expm1(lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time));
double p_0_T = -expm1(lnProbSurvival(0,present_time,1.0) + rateIntegral(0,present_time));
double log_F_t = log(p_0_t) - log(p_0_T);

if ( log_F_t > 0.0 )
{
        std::cout << std::setprecision (20) << "\npSurvival(last_event_time,present_time,1.0) = " << pSurvival(last_event_time,present_time,1.0) << "\nrateIntegral(last_event_time,present_time) = " << rateIntegral(last_event_time,present_time) << "\npSurvival(0,present_time,1.0) = " << pSurvival(0,present_time,1.0) << "\nrateIntegral(0,present_time) = " << rateIntegral(0,present_time);
        std::cout << "\np_0_t = " << p_0_t << "\np_0_T = " << p_0_T << "\nlog_F_t = " << log_F_t << "\npresent_time = " << present_time << "\nlast_event_time = " << last_event_time;
        throw RbException("Problem in computing the probability of missing species in BDP.");
}

I get the following values:

pSurvival(last_event_time,present_time,1.0) = 3.1140446871580990324e-30 rateIntegral(last_event_time,present_time) = 67.725437439794902161 pSurvival(0,present_time,1.0) = 6.4499208327301177728e-118 rateIntegral(0,present_time) = 269.62478019024678133 p_0_t = 0.19442013067406008209 p_0_T = 0.19442013067402572069 log_F_t = 1.7674750552032492124e-13 present_time = 65.091685983700003248 last_event_time = 48.750356713000002173 Error: Problem in computing the probability of missing species in BDP.

bredelings commented 1 year ago

Hmm... so p_0_t and p_0_T are really close. That's good to know. It looks like p_0_T < p_0_t, but this is not supposed to happen. The differences are at the 13th decimal place, which is close to the limit on precision for doubles.

I'm guessing that

lnProbSurvival(0,present_time,1.0) = lnProbSurvival(0,last_event_time,1.0) + lnProbSurvival(last_event_time,1.0)
rateIntegral(0,present_time,1.0) = rateIntegral(0,last_event_time) + rateIntegral(last_event_time,present_time)

If that is true, then we could write something like:

double A = lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time)
double B = lnProbSurvival(0,last_event_time,1.0) + rateIntegral(0,last_event_time)
double p_0_t = -expm1(A);
double p_0_T = -expm1(A+B);
double log_F_t = log(p_0_t) - log(p_0_T);

It might then be possible to express log_F_t as a taylor series in terms of B when B is really small.

Also, I suspect that we need to have A<0 and B<0.

What do you think? If that checks out, then I have an idea. If it does not check out, then I have a different idea.

Jeremy-Andreoletti commented 1 year ago

I tried the trick you suggest but the decomposition only seems to work for rateIntegral, not for lnProbSurvival. So -expm1(A+B) ends up being quite different from p_0_T. Now, I'm curious to know what idea you have in mind for this alternative.

bredelings commented 1 year ago

Drat. I guess that means that the probability of surviving from last_event_time to present_time is not independent of the event of surviving from time 0 to last_event_time.

OK, so if that didn't work, one possibility is that you could say that if the difference between p_0_t and p_0_T is really small then it should be truncated to 0, since the difference could be numerical error. So, something like

    if ( log_F_t > 0.0 )
    {
       if ( log_F_t < 1.0e-11 )
       {
           p_0_T = p_0_t;
           log_F_t = 0;
       }
       else
           throw RbException("Problem in computing the probability of missing species in BDP.");
    }

That is obviously kind of a hack.

Really, someone like @hoehna should probably take a look at this.

Jeremy-Andreoletti commented 1 year ago

Yes, I've tried doing something like that, it runs a bit longer but it always ends up with an error at some point. Sometimes p_0_t or p_0_T becomes so small that it's only stored as 0, and this disrupts the calculation for good and you can get log_F_t > 0.1.

bjoelle commented 1 year ago

looking at the code, it seems we add m*log_F_t = m*[log(p_0_t) - log(p_0_T)] but then 5 lines later we add again m*log(p_0_T) so we don't in fact need p_0_T or log_F_t at all, do we ? that may solve the issue if it's just a problem with the intermediate calculation

Jeremy-Andreoletti commented 1 year ago

Wow, that's unexpected but it seems to work, thanks Joëlle! I was able to remove log_F_t altogether and get the same likelihood with the following code:

// now iterate over the vector of missing species per interval
for (size_t i=0; i<incomplete_clades.size(); ++i)
{
    // We use equation (5) of Hoehna et al.
    // "Inferring Speciation and Extinction Rates under Different Sampling Schemes"
    double last_event_time = root_age - incomplete_clade_ages[i];

    double p_0_t = -expm1(lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time));

    if ( p_0_t < 0.0 || p_0_t > 1.0 )
    {
        throw RbException("Problem in computing the probability of missing species in BDP.");
    }

    // get an estimate of the actual number of taxa
    int m = incomplete_clades[i].getNumberMissingTaxa();

    // multiply the probability for the missing species
    ln_prob_times += m *log(p_0_t);
}

return ln_prob_times;
bredelings commented 1 year ago

looking at the code, it seems we add mlog_F_t = m[log(p_0_t) - log(p_0_T)] but then 5 lines later we add again m*log(p_0_T)

I guess that is correct, although it was took a while to see that. I guess (total_species - num_taxa) is equal to the sum of all the ms.

But I'm not sure that this is right:

    if ( p_0_t < 0.0 || p_0_t > 1.0 )
    {
        throw RbException("Problem in computing the probability of missing species in BDP.");
    }

The original criterion throws an error if p_0_t > p_0_T. While it seems like you don't need p_0_T to compute the likelihood, an error here may mean that there is a bug in the code to compute both p_0_t and p_0_T.

        // We use equation (5) of Hoehna et al.
        // "Inferring Speciation and Extinction Rates under Different Sampling Schemes"
        double last_event_time = root_age - incomplete_clade_ages[i];

Probably someone should read this paper to figure out what is going on here? It would be helpful to know the meaning of p_0_t and p_0_T`, among other things.

Jeremy-Andreoletti commented 1 year ago

The formula appears in this paper: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0084184&type=printable (eq:6), and is briefly explained at the end of this tutorial: https://revbayes.github.io/tutorials/divrate/sampling.html.

The quantity $F_t=1−\frac{1−P(N(T)>0|N(t)=1)exp(r(t,T))}{1−P(N(T)>0|N(t_1)=1)exp(r(t_1,T))}$ (which corresponds to 1 - F_t in the code if I've understood correctly), represents the "cumulative probability density of the time of a speciation event", which justifies checking log_F_t > 0.0.

Nevertheless, I agree that this solution risks masking an error/approximation in the calculation of lnProbSurvival or rateIntegral.