tidymodels / infer

An R package for tidyverse-friendly statistical inference
https://infer.tidymodels.org
Other
726 stars 80 forks source link

get_p_value(): p-values should never be zero #458

Closed tavareshugo closed 2 years ago

tavareshugo commented 2 years ago

I would like to suggest an update to the get_p_value() function so that a p-value of zero is never returned. As indicated in Phipson & Smyth, Permutation P-values Should Never Be Zero, the permutation-based exact p-value (i.e. without replacement) is:

$p_u = \frac{b + 1}{m + 1}$

For methods with replacement (i.e. the bootstrap) the calculation is different, and the function from their package statmod::permp() can be used instead.

I'm not sure it would be easy to implement this, without an additional argument to the function get_p_value() to specify which type of sampling was used ("bootstrap" or "permutation"). But I think the formula above would already be better than the current implementation, even for the bootstrap method (at least for large sample sizes). This could be done by changing the individual *_p_value() functions:

left_p_value <- function(vec, obs_stat) {
  (sum(vec <= obs_stat) + 1)/(length(vec) + 1)
}

right_p_value <- function(vec, obs_stat) {
  (sum(vec >= obs_stat) + 1)/(length(vec) + 1)
}
simonpcouch commented 2 years ago

Thanks for the issue!

Duplicate of #257, #273, and friends. See the discussion there for thoughts from different folks on the infer team—our current approach of warning + documenting here isn't a silver bullet, but feels good to us. :)

library(infer)

df <- data.frame(
  y = c(rnorm(100, mean = 0), rnorm(100, mean = 10)),
  x = rep(c("a", "b"), each = 100)
)

dist <-
  df %>%
  specify(y ~ x) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 200, type = "permute") %>%
  calculate("diff in means", order = c("a", "b"))

obs <- 
  df %>%
  specify(y ~ x) %>%
  calculate("diff in means", order = c("a", "b"))

get_p_value(dist, obs, "left")
#> Warning: Please be cautious in reporting a p-value of 0. This result is an
#> approximation based on the number of `reps` chosen in the `generate()` step. See
#> `?get_p_value()` for more information.
#> # A tibble: 1 × 1
#>   p_value
#>     <dbl>
#> 1       0

Created on 2022-09-08 by the reprex package (v2.0.1)

tavareshugo commented 2 years ago

Oh, sorry I'd missed those closed issues. Thanks for considering it anyway!

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.