Open Jeremy-Andreoletti opened 1 year ago
Thanks for the analysis. I am not familiar with this code, but I can see how it could be unstable. You could get an even better idea what is going on by printing the value of the variables when something goes wrong. For example:
throw RbException()<<"Problem in computing the probability of missing species in BDP. log_F_t = "<<log_F_t;
For the issue that you mentioned, one improvement could be to replace:
double p_0_T = 1.0 - pSurvival(0,present_time,1.0) * exp( rateIntegral(0,present_time) );
with
double p_0_T = 1.0 - exp( lnProbSurvival(0,present_time,1.0) + rateIntegral(0,present_time) );
I think there is a but in lnProbSurvival where it does return 0.0;
, but it should do return log(0.0);
You would then make the equivalent change for p_0_t
. I'm not sure if this will work or not -- ultimately lnProbSurvival( )
calls computeProbabilitySurvival(start, end)
which is not on the log scale.
You could then replace 1.0 - exp( thing )
with -expm1( thing )
. That will help numerical stability when thing
is very small.
Thanks for the advice, I was optimistic that it might work but it doesn't seem to be enough.
double p_0_t = -expm1(lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time));
double p_0_T = -expm1(lnProbSurvival(0,present_time,1.0) + rateIntegral(0,present_time));
double log_F_t = log(p_0_t) - log(p_0_T);
if ( log_F_t > 0.0 )
{
std::cout << std::setprecision (20) << "\npSurvival(last_event_time,present_time,1.0) = " << pSurvival(last_event_time,present_time,1.0) << "\nrateIntegral(last_event_time,present_time) = " << rateIntegral(last_event_time,present_time) << "\npSurvival(0,present_time,1.0) = " << pSurvival(0,present_time,1.0) << "\nrateIntegral(0,present_time) = " << rateIntegral(0,present_time);
std::cout << "\np_0_t = " << p_0_t << "\np_0_T = " << p_0_T << "\nlog_F_t = " << log_F_t << "\npresent_time = " << present_time << "\nlast_event_time = " << last_event_time;
throw RbException("Problem in computing the probability of missing species in BDP.");
}
I get the following values:
pSurvival(last_event_time,present_time,1.0) = 3.1140446871580990324e-30 rateIntegral(last_event_time,present_time) = 67.725437439794902161 pSurvival(0,present_time,1.0) = 6.4499208327301177728e-118 rateIntegral(0,present_time) = 269.62478019024678133 p_0_t = 0.19442013067406008209 p_0_T = 0.19442013067402572069 log_F_t = 1.7674750552032492124e-13 present_time = 65.091685983700003248 last_event_time = 48.750356713000002173 Error: Problem in computing the probability of missing species in BDP.
Hmm... so p_0_t
and p_0_T
are really close. That's good to know. It looks like p_0_T < p_0_t, but this is not supposed to happen. The differences are at the 13th decimal place, which is close to the limit on precision for double
s.
I'm guessing that
lnProbSurvival(0,present_time,1.0) = lnProbSurvival(0,last_event_time,1.0) + lnProbSurvival(last_event_time,1.0)
rateIntegral(0,present_time,1.0) = rateIntegral(0,last_event_time) + rateIntegral(last_event_time,present_time)
If that is true, then we could write something like:
double A = lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time)
double B = lnProbSurvival(0,last_event_time,1.0) + rateIntegral(0,last_event_time)
double p_0_t = -expm1(A);
double p_0_T = -expm1(A+B);
double log_F_t = log(p_0_t) - log(p_0_T);
It might then be possible to express log_F_t
as a taylor series in terms of B when B
is really small.
Also, I suspect that we need to have A<0
and B<0
.
What do you think? If that checks out, then I have an idea. If it does not check out, then I have a different idea.
I tried the trick you suggest but the decomposition only seems to work for rateIntegral
, not for lnProbSurvival
. So -expm1(A+B)
ends up being quite different from p_0_T
.
Now, I'm curious to know what idea you have in mind for this alternative.
Drat. I guess that means that the probability of surviving from last_event_time to present_time is not independent of the event of surviving from time 0 to last_event_time.
OK, so if that didn't work, one possibility is that you could say that if the difference between p_0_t and p_0_T is really small then it should be truncated to 0, since the difference could be numerical error. So, something like
if ( log_F_t > 0.0 )
{
if ( log_F_t < 1.0e-11 )
{
p_0_T = p_0_t;
log_F_t = 0;
}
else
throw RbException("Problem in computing the probability of missing species in BDP.");
}
That is obviously kind of a hack.
Really, someone like @hoehna should probably take a look at this.
Yes, I've tried doing something like that, it runs a bit longer but it always ends up with an error at some point. Sometimes p_0_t
or p_0_T
becomes so small that it's only stored as 0, and this disrupts the calculation for good and you can get log_F_t > 0.1
.
looking at the code, it seems we add m*log_F_t = m*[log(p_0_t) - log(p_0_T)]
but then 5 lines later we add again m*log(p_0_T)
so we don't in fact need p_0_T
or log_F_t
at all, do we ? that may solve the issue if it's just a problem with the intermediate calculation
Wow, that's unexpected but it seems to work, thanks Joëlle!
I was able to remove log_F_t
altogether and get the same likelihood with the following code:
// now iterate over the vector of missing species per interval
for (size_t i=0; i<incomplete_clades.size(); ++i)
{
// We use equation (5) of Hoehna et al.
// "Inferring Speciation and Extinction Rates under Different Sampling Schemes"
double last_event_time = root_age - incomplete_clade_ages[i];
double p_0_t = -expm1(lnProbSurvival(last_event_time,present_time,1.0) + rateIntegral(last_event_time,present_time));
if ( p_0_t < 0.0 || p_0_t > 1.0 )
{
throw RbException("Problem in computing the probability of missing species in BDP.");
}
// get an estimate of the actual number of taxa
int m = incomplete_clades[i].getNumberMissingTaxa();
// multiply the probability for the missing species
ln_prob_times += m *log(p_0_t);
}
return ln_prob_times;
looking at the code, it seems we add mlog_F_t = m[log(p_0_t) - log(p_0_T)] but then 5 lines later we add again m*log(p_0_T)
I guess that is correct, although it was took a while to see that. I guess (total_species - num_taxa)
is equal to the sum of all the m
s.
But I'm not sure that this is right:
if ( p_0_t < 0.0 || p_0_t > 1.0 )
{
throw RbException("Problem in computing the probability of missing species in BDP.");
}
The original criterion throws an error if p_0_t > p_0_T
. While it seems like you don't need p_0_T to compute the likelihood, an error here may mean that there is a bug in the code to compute both p_0_t and p_0_T.
// We use equation (5) of Hoehna et al.
// "Inferring Speciation and Extinction Rates under Different Sampling Schemes"
double last_event_time = root_age - incomplete_clade_ages[i];
Probably someone should read this paper to figure out what is going on here? It would be helpful to know the meaning of p_0_t
and p_0_T`, among other things.
The formula appears in this paper: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0084184&type=printable (eq:6), and is briefly explained at the end of this tutorial: https://revbayes.github.io/tutorials/divrate/sampling.html.
The quantity $F_t=1−\frac{1−P(N(T)>0|N(t)=1)exp(r(t,T))}{1−P(N(T)>0|N(t_1)=1)exp(r(t_1,T))}$ (which corresponds to 1 - F_t
in the code if I've understood correctly), represents the "cumulative probability density of the time of a speciation event", which justifies checking log_F_t > 0.0
.
Nevertheless, I agree that this solution risks masking an error/approximation in the calculation of lnProbSurvival
or rateIntegral
.
Describe the bug The code for empirical taxon sampling (clade-specific rho) with
dnEpisodicBirthDeath
throws an unexpected exception: "Problem in computing the probability of missing species in BDP."To Reproduce From the tutorial Diversification Rate Estimation with Missing Taxa, download mcmc_EBD.Rev and primates_tree.nex.
Then, follow the tutorial instruction in the "Empirical Taxon Sampling" section to modify
mcmc_EBD.Rev
(btw there is a typo, "Varecia_variegata_variegata" should be "Varecia_variegata"), or directly use mcmc_EBD_CSrho.Rev.txt.Run
mcmc_EBD_CSrho.Rev
.Expected behavior Running the MCMC.
Computer info MacOS Ventura 13.5.2 (Apple M1 Pro) RevBayes version (1.2.1), and I also tried with the
developer
branch.Additional context The error arises in the
computeLnProbabilityTimes
function (inBirthDeathProcess.cpp
).I've observed that sometimes pSurvival values can be very low (down to e-300) and exp(rateIntegral) very high (up to exp(600)), so I suppose this can lead to numerical instabilities when p_0_T and p_0_t are expected to be very close.
The calculations are based on "Inferring Speciation and Extinction Rates under Different Sampling Schemes" and Likelihood Inference of Non-Constant Diversification Rates with Incomplete Taxon Sampling (@hoehna).