mthulin / boot.pval

Bootstrap p-values, including convenience functions for regression models.
Other
3 stars 1 forks source link

boot.pval throws an error for BCa confidence intervals #4

Closed astaroph closed 1 month ago

astaroph commented 2 years ago

Just a small note to draw your attention to an error I got when calculating p-values for BCa confidence intervals along with my temporary fix (though I still don't understand why it works). Hopefully you might!

When using the original boot.pval function I get the following error when type='bca' but not when type is left to defaults:

_Error in if (any(ints)) out[inds[ints]] <- tstar[k[inds[ints]]] : missing value where TRUE/FALSE needed 6. norm.inter(t, adj.alpha) 5. bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o, h = h, hdot = hdot, hinv = hinv, ...) 4. boot::boot.ci(boot_res, conf = 1 - alpha_seq, type = type, ...) 3. withCallingHandlers(expr, warning = function(w) if (inherits(w, classes)) tryInvokeRestart("muffleWarning")) 2. suppressWarnings(boot::boot.ci(boot_res, conf = 1 - alpha_seq, type = type, ...)) 1. boot.pval_fixed(boot_out, type = "bca", thetanull = 0.5)

I apparently fixed it by changing a single value in the following line: original (doesn't work with type='bca'): _alpha_seq <- seq(1e-16, 1-1e-16, pvalprecision)

modified (does work with type='bca') _alpha_seq <- seq(1e-15, 1-1e-15, pvalprecision)

Strange!

mthulin commented 2 years ago

Thanks for posting this! I'll look into it as soon as I can.

astaroph commented 2 years ago

Sure thing!

I also noticed something else: When running this again on another dataset _alpha_seq <- seq(1e-15, 1-1e-15, pvalprecision) also failed, but _alpha_seq <- seq(pval_precision, 1-pval_precision, pvalprecision) seems to work every time (so far), although this now has the downside of limiting the minimum possible p-value by the number of bootstraps... Anyhow, just wanted to give you a heads up, but thanks for putting out there a truly useful solution for generating p-values from bootstrapping results!

arielhasidim commented 1 year ago

+1

cluster_boot_out <- boot(pair_ids, cluster_boot_fun, R = 9999)
boot.pval(cluster_boot_out, 
          theta_null = 1,
          type = "bca")

# Error in if (any(ints)) out[inds[ints]] <- tstar[k[inds[ints]]] :
# missing value where TRUE/FALSE needed
# 6.
# norm.inter(t, adj.alpha)
# 5.
# bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o,
# h = h, hdot = hdot, hinv = hinv, ...)
# 4.
# boot::boot.ci(boot_res, conf = 1 - alpha_seq, type = type, ...)
# 3.
# withCallingHandlers(expr, warning = function(w) if (inherits(w,
# classes)) tryInvokeRestart("muffleWarning"))
# 2.
# suppressWarnings(boot::boot.ci(boot_res, conf = 1 - alpha_seq,
# type = type, ...))
# 1.
# boot.pval(cluster_boot_out, theta_null = 1, type = "bca")
Sciurus365 commented 1 month ago

Hi, I had the same problem and I figured out why it happened. In boot.pval(), the alpha sequence is set as alpha_seq <- seq(1e-16, 1-1e-16, pval_precision), and 1-alpha_seq is eventually passed to boot:::bca.ci() as the confidence level. But 1e-16 is too small that although R thinks 1e-16!=0, it thinks (1+1-1e-16)/2==1, and qnorm((1+1-1e-16)/2) is Inf (a step in boot:::bca.ci()). (So if a value's difference to 1 is smaller than 5e-17, R thinks it is exactly 1. You can test it by checking 1-5e-17 == 1.) This Inf later leads to the missing value reported above. Making 1e-16 a little bit larger will solve this problem, as suggested by @astaroph. I'll submit a pull request if that helps. @mthulin (I didn't encounter any cases where changing it to 1e-15 still leads to problems, but I'm also happy to look into it. @astaroph In case you are still interested in this issue and have a reproducible example, just let me know!)

astaroph commented 1 month ago

@Sciurus365 It's been a little bit since I revisited this issue and my datasets have evolved a bit since then, so I couldn't find the specific example where setting it to 1e-15 failed (though I know it did). However, I think I may have traced the issue to the pval_precision parameter. Observe: Assuming you are defining the alpha_seq the same way and all you change is the pval_precision parameter (which will vary with the number of boots as it is defined as pval_precision = 1/boot_res$R if left blank then: alpha_seq=seq(1e-15, 1-1e-15, 1e-05) will yield a proper sequence of 100000 elements ending with 0.99999 However alpha_seq=seq(1e-15, 1-1e-15, 1e-04) will yield a sequence of 10,001 elements the last of which is a NA, yet alpha_seq=seq(1e-15, 1-1e-15, 1e-06) will yield a proper sequence of 1,000,000 elements ending with 0.999999.

This suggests that the step size of the sequence set by pval_precision (and which depends on the number of boot supplied by the user) interacts with the specific minimum value set in the alpha sequence in a way that does not completely depend on how large this minimum value is. Pragmatically, I could not reproduce this error trying a range of different precision values of different orders of magnitude when I used 1e-14 as opposed to 1e-15, so perhaps this is just an edge case that only happens when the step size and sequence boundary sizes are just small enough to trigger a NA at the end of the sequence? I will try setting mine to 1e-14 and see if anything ever triggers it and report back if it does.

arielhasidim commented 4 weeks ago

+1 to @astaroph. I also got the error when setting it to 1e-15.

This is my workaround (inside, there is a link to the patchwork code): https://stats.stackexchange.com/questions/635200/can-i-remove-a-single-bootstrap-sample-before-calculating-ci

Sciurus365 commented 3 weeks ago

@Sciurus365 It's been a little bit since I revisited this issue and my datasets have evolved a bit since then, so I couldn't find the specific example where setting it to 1e-15 failed (though I know it did). However, I think I may have traced the issue to the pval_precision parameter. Observe: Assuming you are defining the alpha_seq the same way and all you change is the pval_precision parameter (which will vary with the number of boots as it is defined as pval_precision = 1/boot_res$R if left blank then: alpha_seq=seq(1e-15, 1-1e-15, 1e-05) will yield a proper sequence of 100000 elements ending with 0.99999 However alpha_seq=seq(1e-15, 1-1e-15, 1e-04) will yield a sequence of 10,001 elements the last of which is a NA, yet alpha_seq=seq(1e-15, 1-1e-15, 1e-06) will yield a proper sequence of 1,000,000 elements ending with 0.999999.

This suggests that the step size of the sequence set by pval_precision (and which depends on the number of boot supplied by the user) interacts with the specific minimum value set in the alpha sequence in a way that does not completely depend on how large this minimum value is. Pragmatically, I could not reproduce this error trying a range of different precision values of different orders of magnitude when I used 1e-14 as opposed to 1e-15, so perhaps this is just an edge case that only happens when the step size and sequence boundary sizes are just small enough to trigger a NA at the end of the sequence? I will try setting mine to 1e-14 and see if anything ever triggers it and report back if it does.

Hey, thanks for your reply! Do you mean the vector from seq(1e-15, 1-1e-15, 1e-04) yields an NA at the end, or it leads to an NA later in the code? Because I just tried seq(1e-15, 1-1e-15, 1e-04) and the last element equals 1-1e-15 and is not NA. (If you get an NA here directly, which R version are you using?)

astaroph commented 2 weeks ago

@Sciurus365 Hmmmm. I think I need to do a bit more bug tracing. seq(1e-15, 1-1e-15, 1e-04) was definitely giving me a sequence with 10001 elements with a NA at the end, however, it is now giving me a sequence with 10001 elements with a 1 at the end, on two different versions of R (4.0.3 and 4.1.0). Still odd it does not match what you are seeing and that troubles me a bit.

Pragmatically, specifying alpha seq with a range of minimum values from 1e-10 to 1e-16 all sporadically yielded errors. If it is helpful context, I'm running these calculations on a range summary metrics I calculate and then bootstrap from different columns in a PC dataset, so calculations on a small number of columns would still fail with: Error in out[inds[ints]] <- tstar[k[inds[ints]]] : replacement has length zero

The only fix I have found that seems to be robust to all my different datasets and metrics is to define alpha_seq as: alpha_seq <- seq(pval_precision, 1-pval_precision, pval_precision)

This does have the effect of making your theoretical minimum p-value only as small as your pval_precision (i.e. limited by your number of bootstraps). However, if I understand the theory correctly, your minimum p-value is indeed limited by the number of bootstraps already (or at least it should be with a simple percentile p-value calculation). Defining alpha_seq as alpha_seq=seq(1e-15, 1-1e-15, pval_precision) means that there will be a massive jump between your minimum possible p-value and the next smallest p-value, while defining it as alpha_seq <- seq(pval_precision, 1-pval_precision, pval_precision) has the advantage of resulting in a continuous distribution of possible p-values.

@mthulin, may I ask, is there a theoretical rationale is for having a very low fixed minimum p-value in alpha_seq that does depend on the precision/number of bootstraps? I wasn't sure if there was sufficient precision in the set of confidence intervals generated for the range of confidence values defined by alpha_seq to allow for p_values with more precision than your number of bootstraps or if having a small minimum p-value is just a convenient numerical stand in for a p-value close to but not exactly zero. Many thanks to you both and sorry for the convolute discussion!

Sciurus365 commented 6 days ago

@astaroph There might be something confusing with the display of the numbers in R. If I print the final element it shows 1, but if I subtract 1 from that value it gives something close to 1e-15. (I'm on R 4.4.1.)

> seq(1e-15, 1-1e-15, 1e-04)[10001]
[1] 1
> seq(1e-15, 1-1e-15, 1e-04)[10001]-1
[1] -9.992007e-16

It could well be the case that in earlier versions of R the final element is NA but the bug is now fixed. Changing that small value to pval_precision (or maybe 0.5 * pval_precision or 0.1 * pval_precision) also makes sense to me. I can imagine this might change some results though. Not sure what @mthulin would think about this proposal.

mthulin commented 6 days ago

I like the idea of using pval_precision this way @astaroph and @Sciurus365. I'll be running some simulations over the next few days to see if it causes any unexpected or unwanted behaviour. If not, it seems like a good way forward here!