Error when estimating extinction times on synthetic data.

victorwctsang commented 1 year ago

rror message is:

Error in if (dfx == 0) { : missing value where TRUE/FALSE needed
3. pracma::newtonRaphson(fun = function(.th) estimating_eqn(.th,
q, K, u, m, n, eps.sigma), x0 = theta_q.hat)
2. estimate_quantile.minmi(K = K, W = ages, u = mc.samples[, 1:result$B[[i]]],
eps.sigma = sd, q = q[i])
1. rminmi::minmi(ages = W, sd = s, K = synthetic.data.config$K,
alpha = 0.05)

Reproducing the error

Code to reproduce the error:

# Load the synthetic data and config
load("synthetic-data.RData")

# Select the problematic dataset (this is only one of many)
idx <- 310
W = as.numeric(datasets[idx, ]$W[[1]])
error_factor = as.numeric(datasets[idx, ]$error_factor[[1]])
# NB: We use the same measurement error standard errors for all the datasets, but apply different error multipliers.
s = synthetic.data.config$fossil.sd * error_factor

# Run minmi on the offending dataset
rminmi::minmi(ages = W,
                        sd = s,
                        K = synthetic.data.config$K,
                        alpha = 0.05
                        # Don't give any additional tuning parameters for the Monte Carlo estimates
                        # B = NA, A = NA, .B_init = NA
)

Additional information:

The synthetic data and configuration can be downloaded from: (synthetic-data.zip)
There are two objects in synthetic-data.RData:
- synthetic.data.config: these are the constants used to generate the synthetic data.
- datasets: these are the synthetic datasets. For each dataset, there is a fossil record W with $n = $ n.samples and the error factor used. All datasets used the same set of $\sigma$'s, which are stored in synthetic.data.config as fossil.sd

cc @dwarton

victorwctsang commented 1 year ago

Potential causes:

p-value function is no longer stochastically increasing,
Estimating equation is wrong
Root-finding algorithm is bugged

dwarton commented 1 year ago

OK I've found the problem, it is none of the above 🤣 If you use options(error=recover) you can get into the function workspace at the point the error is produced. I did this, entering the estimate_quantile.minmi workspace. You can see in this workspace that at the time the error was produced q was 0.975. I then ran:

thetas = seq(10000,25000,length=100) eeqns = rep(NA,length(thetas)) for (iTheta in 1:length(thetas)) eeqns[iTheta] = estimating_eqn(thetas[iTheta], q, K, u, m, n, eps.sigma) plot(thetas,eeqns,type="l") abline(a=0,b=0,col="red") Note it is monotonically increasing towards zero. max(eeqns,na.rm=TRUE) But it never got to zero. The problem appears to be that we got a NaN error before we got to the solution (which is actually there if we could get to it). Getting inside estimating_eqn to see what is happening: theta = 2200 F.eps.m <- pnorm(m - theta, mean = 0, sd = eps.sigma) F.eps.K <- pnorm(K - theta, mean = 0, sd = eps.sigma) e <- uniform_to_tnorm(u, eps.sigma, a = -Inf, b = m - theta) psi.hat <- apply((m - e - theta) / (K - e - theta), MARGIN = 1, FUN=mean) and we can store the key vector in the estimating equation as ee = log(1 - F.eps.m / F.eps.K * psi.hat) print(ee) and look at its values. Notice that one of the values is NaN: whichNA=which(is.na(ee)) it is the 27th value. Notice also that for any reasonable value of theta, all values of ee are small except for one: whichBig=which(ee==min(ee)) is the 33rd value. What is different about these values... well looking at eps.sigma summary(eps.sigma) it has a very wide range of values, going from 194 up to 4783. The value with big ee is the one with the largest sd (almost double everything else) and value that gives an NA for ee is the one with smallest sd. So the problem seems to be that when given extreme ranges of sds (these ones look like they have an impractically wide range to me - check with the data we have but I think they tend to vary less than this), we encounter machine error computing ee contributions for each observation before we get to the solution.

Looking more deeply to see where the NA comes from, it looks like its source is the uniform_to_tnorm function when asked to truncate normal values a long way from the origin when sd is small. In these cases however psi.hat is one anyway so I would suggest replacing these values with one where they occur, along the lines of psi.hat[(m-theta)/eps.sigma<-XXX]=1

dwarton commented 1 year ago

we could go with return(sum(log(1 - F.eps.m / F.eps.K * psi.hat), na.rm=TRUE) - log(q)) but that is more dangerous because there might be other reasons for NaN other than truncating a normal at a ridiculously small value

dwarton commented 1 year ago

to escape error recovery in future work type options(error=NULL)

dwarton commented 1 year ago

@victorwctsang please check you agree and close if appropriate

victorwctsang / rminmi

Error when estimating extinction times on synthetic data. #1

Reproducing the error