Change computation of two-sided p-value

echasnovski commented 5 years ago

This issue is a result of discussion started in #203 (after this comment). I think it deserves a separate thread.

The suggestion is to change method of two-sided p-value computation in get_p_value() to "Chihara and Hesterberg" (CaH) approach: "For a two-sided test, we calculate both one-sided P-values, multiply the smaller by 2, and finally (if necessary) round down to 1.0 (because probabilities can never be larger than 1.0)."

There is also a desire for get_p_value() and shade_p_value() (now available only in develop branch) to use the same approach of computing two-sided p-value.

echasnovski commented 5 years ago

This might be a tricky task and here are the reasons:

Using CaH approach (as the only option or as default version for a extra argument suggested here) will break existing package behavior in a very hard-to-notice way, as the outputs will be different but similar. Here is a quick exploration of differences:

Simulation of two two-sided p-value approaches

library(tidyverse)

set.seed(20181019)

theme_set(theme_bw())

# Functions to compute p-value --------------------------------------------
get_percentile <- function(vector, observation) {
  stats::ecdf(vector)(observation)
}

two_sided_p_value_current <- function(x, obs_stat){

  if(stats::median(x$stat) >= obs_stat){
    basic_p_value <- get_percentile(x$stat, obs_stat) +
      (1 - get_percentile(x$stat, stats::median(x$stat) +
                            stats::median(x$stat) - obs_stat))
  } else {
    basic_p_value <- 1 - get_percentile(x$stat, obs_stat) +
      (get_percentile(x$stat, stats::median(x$stat) +
                        stats::median(x$stat) - obs_stat))
  }

  if(basic_p_value >= 1)
    # Catch all if adding both sides produces a number
    # larger than 1. Should update with test in that
    # scenario instead of using >=
    return(tibble::tibble(p_value = 1))
  else
    return(tibble::tibble(p_value = basic_p_value))
}

two_sided_p_value_CaH <- function(x, obs_stat) {
  left_pval <- mean(x[["stat"]] <= obs_stat)
  right_pval <- mean(x[["stat"]] >= obs_stat)
  raw_res <- 2 * min(left_pval, right_pval)

  tibble::tibble(p_value = min(raw_res, 1))
}

# Simulation function -----------------------------------------------------
simulate_p_values <- function(n = 100) {
  x <- tibble(stat = rnorm(n))
  obs_stat <- runif(1, min = -3, max = 3)

  tibble(
    pval_current = two_sided_p_value_current(x, obs_stat)[["p_value"]],
    pval_CaH = two_sided_p_value_CaH(x, obs_stat)[["p_value"]]
  )
}

# Simulation and visualization --------------------------------------------
map_dfr(seq_len(1000), function(i) {
  simulate_p_values()
}) %>%
  ggplot(aes(pval_current, pval_CaH)) +
    geom_point() +
    geom_abline(slope = 1, intercept = 0, colour = "red") +
    coord_equal() +
    labs(
      title = "Current and CaH approaches for two-sided p-value",
      subtitle = paste0(
        "They seem different but not drastically, which may introduce\n",
        "hard-to-notice breaking change."
      )
    )

^{Created on 2018-10-19 by the reprex package (v0.2.1)}

Visual representation of CaH approach in shade_p_value() is also tricky. The best I can think of right now is to shade respective (that gives smaller p-value) left or right tail with extra intensity to reflect multiplication by 2. But this should be a very unintuitive visualization if one doesn't know about the method.

For now I think there are three reasonable possible choices of how to handle this issue:

Don't do anything.
Add new argument two_sided_method to get_p_value() which will change the computation method. The default value should indicate the current approach. This and previous one don't break anything.
Change default computation method in both get_p_value() (maybe with extra argument) and shade_p_value(). This breaks current behavior in a hard-to-notice ways, so some extra notification policy should be thought through. Also the visualization of CaH approach seems to be unintuitive right now.

My personal preference here is either one that don't break anything, as there are a lot of sources that depend on this package. One of their most important goals is to describe logic of p-value computation, which will need to be updated after suggested breaking changes. And it seems like a hard thing to manage.

andrewpbray commented 5 years ago

p-value computation

I see three potential ways to compute these two-tailed p-values:

Reflect the observed stats around the center (mean/median) of the distribution and measure tail area (what we're currently doing).
Take the smaller of the two one-tailed values and multiply it by two (the CaH method).
Draw a horizontal line across the sampling distribution where it hits the vertical line of the observed statistic. The two-tailed p-val would then be the area corresponding to relative frequencies less than that line.

One way we could decide on which to use is to use the one with the more intuitive interpretation. Each approach leads to a slightly different answer to the question of what we mean when we say a p-value is the probability of getting a statistic "more extreme than the observed" under H_0.

1) interprets "more extreme" in terms of magnitude of the statistic; 3) interprets it "more extreme" as being less likely. 2)... i'm actually not quite sure what to think of this interpretation. I do know that 1) becomes problematic when we're dealing with strongly skewed distributions. Although 3) seems more intuitive to me, I'm inclined to go with 2) as it's become the standard approach.

In fact, I'd love to tag Tim Hesterberg here, but I'm not sure he's active on GitHub. I'll send him an email and direct him to this issue to see if we can get his take.

Another way to decide between 1 - 3 is to pick the one with more desirable statistical properties. As far as I know, that means selecting the approach that leads to the maximum power while ensuring that the type I error rate always stays <= alpha. I'm not sure if/how this could be evaluated analytically for all statistics/sampling dists for any alpha, but maybe Tim will know about those results. We could also do simulation spot checks if this is something we're really worried about.

p-value shading

I agree that whatever we decide in terms of the computation needs to be reflected in the visualization. Although 1) is the easiest to shade, 2) and 3) are both possible. I have a side/back burner project to build masking functionality into ggplot2 layers, which would make this easier, but in the meantime, Mine pointed out we could use factor-filled histograms:

library(tidyverse)
set.seed(1023)

df <- tibble(
  x = rnorm(1000), 
  y = ifelse(abs(x) > 0.8, TRUE, FALSE)
)

ggplot(df, aes(x, fill = y)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-10-19 by the reprex package (v0.2.0).

In this case, we could get the appropriate values for $y$ by inspecting what comes out of ggplot_build() after forming the first unshaded histogram.

@echasnovski do you know if this change in p-value computation will break (and necessitate repairing) anything beyond the shading?

echasnovski commented 5 years ago

Functions get_p_value() and shade_p_value() have independent implementations of p-value computations. get_p_value() has this function and shade_p_value() this one (which acts here as input to custom geom_tail()). So change in one wouldn't break the other. They will just have different implementations which is undesirable.

However, changing method will break existing approach and previous users' code. This is not something that should be repaired, but something that should be notified about (preferably even in advance, I guess). But I am not sure how to do that right now.

About other things:

I think approach 2) (CaH) can also be interpreted as '"more extreme" in terms of magnitude of the statistic' but calculated with different approach. Just like there are different distance method for measuring how far point is from the zero on a plane.
Approach 3) is somewhat new to me. As it does make sense, I think there is an issue about what to consider to be points where "horizontal line... hits the vertical line of the observed statistic". Basically, it means we need to compute the frequency (density) at point of observed statistic. This is a tricky thing to do as it depends on bin width of the histogram, which shouldn't be a good design. Or do you have different approach in mind?
Difference between get_p_value() and shade_p_value() is that the latter should deal not only with visualizing generate()d data, but also with "theoretical" and "both". That is why I really like the current "shading" approach and find factor-filled histograms unnecessary constricting.
Visualizing approach 2) is as much doable, as I described earlier. No better suggestions for now. As for approach 3), then this totally can be done, as bin width is set there. And "theoretical" approach won't even change (assuming theoretical distribution is centered at 0, which is done now).

echasnovski commented 5 years ago

Following #206 (this comment in particular), even the current implementation of two_sided_p_value() should be updated to something like this:

two_sided_p_value <- function(x, obs_stat){
  median_reflection <- 2 * stats::median(x$stat) - obs_stat

  left_border <- min(obs_stat, median_reflection)
  right_border <- max(obs_stat, median_reflection)

  left_tail <- mean(x[["stat"]] <= left_border)
  right_tail <- mean(x[["stat"]] >= right_border)

  p_val <- min(left_tail + right_tail, 1)

  tibble::tibble(p_value = p_val)
}

And, probably, also the visualization logic (to make them use identical approach):

mirror_obs_stat <- function(vector, observation) {
  2 * stats::median(vector) - observation
}

echasnovski commented 5 years ago

@andrewpbray, is there any settled decision about which method of two-sided p-value should be used here?

andrewpbray commented 5 years ago

@echasnovski you bring up some great points.

I think approach 2) (CaH) can also be interpreted as '"more extreme" in terms of magnitude of the statistic' but calculated with different approach.

Yes, this interpretation makes sense to me.

Approach 3 . . . depends on bin width of the histogram.

That's an excellent point. I don't imagine that there'd be a clearly optimal way to calculate this sort of p-value. I think we'd just have to make it subject to the same generally-decent algorithm that the histogram function uses.

shade_p_value() should deal not only with visualizing generate()d data, but also with "theoretical" and "both".

I agree, that's a good design goal and I guess the shading does help there. What is lost, at least to my eye, is the notion of "areas under the curve" since the shading isn't just of the curve. The solution I'd like to build out (but it'd take far more time than geom_rect() shading and factor filled histograms) is the idea of a mask layer in ggplot 2. Basically, you'd put in your first histogram layer in gray, as the actual sampling distribution, then a geom mask layer that would mask off the middle of the plot from future layers, then a second histogram layer in red, which would only paint over the tails. It'd be very similar to the shading approach in terms of overall design, I'd think.

What about:

1) switching over to approach 2) for get_p_value() (and using the <= noted in #206 ) 2) adapting shade_p_value() to get the appropriate left and right sides of that geom_rect()

then if/when I get around to a ggplot2 mask layer, we can swap out 2?

Oh, btw, no word back from Tim Hesterberg yet.

echasnovski commented 5 years ago

Some overnight insight:

Currently get_p_value() and visualize() have different logic of computing/representing two-sided p-values. They both essentially act by algorithm "compute complementary tail border and return sum of two tail areas" but compute second border differently. As get_p_value() mirrors observed statistic around median of sample distribution (see here), visualize() chooses it so that the complementary tail would have equal area not greater than 0.5 (see here and here).
Current visualize() approach in fact demonstrates the "Chihara and Hesterberg" approach. As it implies that the complementary tail has the same area as the current one (what doubling in "CaH" does) and both of them are not greater than 0.5. This seems to be a fairly logical way to illustrate "CaH" approach. They differ only slightly when "CaH" method returns 0 as in that case complementary tail can't be 0 due to quantile() behavior when there is 0 or 1 in probs argument. See exploration.

Exploration of "CaH" and `visualize()` two two-sided p-value approaches

``` r library(tidyverse) # P-value functions ------------------------------------------------------- # The CaH approach two_sided_p_value_cah <- function(x, obs_stat) { left_pval <- mean(x[["stat"]] <= obs_stat) right_pval <- mean(x[["stat"]] >= obs_stat) raw_res <- 2 * min(left_pval, right_pval) tibble::tibble(p_value = min(raw_res, 1)) } # The current approach implied in `visualize()` two_sided_p_value_viz <- function(x, obs_stat){ obs_stat_reflection <- mirror_obs_stat(x[["stat"]], obs_stat) left_border <- min(obs_stat, obs_stat_reflection) right_border <- max(obs_stat, obs_stat_reflection) left_tail <- mean(x[["stat"]] <= left_border) right_tail <- mean(x[["stat"]] >= right_border) p_val <- min(left_tail + right_tail, 1) tibble::tibble(p_value = p_val) } mirror_obs_stat <- function(vector, observation) { obs_percentile <- stats::ecdf(vector)(observation) stats::quantile(vector, probs = 1 - obs_percentile) } # Helper functions -------------------------------------------------------- # Simulates a random sample from Beta distribution (chosed for variative # skewness) simulate_x <- function(n_obs, shape1, shape2) { tibble( replication = seq_len(n_obs), stat = rbeta(n_obs, shape1, shape2) ) } # Simulates observed statistic simulate_obs_stat <- function(n) { runif(n) } # Exploration ------------------------------------------------------------- set.seed(20181027) simulation_tbl <- tibble(id = seq_len(1000)) %>% mutate( n_obs = sample(10:1000, n(), replace = TRUE), # Parameters of Beta distribution shape1 = runif(n(), min = 1, max = 100), shape2 = runif(n(), min = 1, max = 100), # Inputs for `two_sided_p_value_*()` functions x = pmap(list(n_obs, shape1, shape2), simulate_x), obs_stat = simulate_obs_stat(n()), # Compute different p-values pval_cah = map2_dbl( x, obs_stat, ~ two_sided_p_value_cah(.x, .y)[["p_value"]] ), pval_viz = map2_dbl( x, obs_stat, ~ two_sided_p_value_viz(.x, .y)[["p_value"]] ) ) simulation_tbl %>% mutate( pval_diff = pval_viz - pval_cah, is_pvalCaH_zero = if_else(pval_cah == 0, "zero", "not zero") ) %>% ggplot(aes(n_obs, pval_diff, colour = is_pvalCaH_zero)) + geom_point() + scale_colour_manual(values = c(zero = "#30B030", `not zero` = "#B03030")) + labs( x = "Number of observations", y = '"Viz" p-value minus "CaH" p-value', colour = "Is CaH p-value zero?", title = "CaH and current `visualize()` approaches are fairly equal", subtitle = paste0( "They only differ if CaH approach returns 0. ", "In that case `visualize()` never can return 0 because both\n", "`quantile(x, probs = 0)` and `quantile(x, probs = 1)` ", "return minimum and maximum value.\n", "Together with not strict >= or <= sign it returns non-zero but small ", "p-value which\ndecreases as sample increases ." ) ) + theme_bw() ``` ![](https://i.imgur.com/2VYBLGd.png) ^{Created on 2018-10-27 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1)}

So my conclusions are:

We actually need to replace computation of two-sided p-value with "CaH" approach to match current visualize() output. Note that this breaks previous code.
We don't need to change logic of visualize().

@andrewpbray, @ismayc, if that sounds good, I can make a PR.

ismayc commented 5 years ago

That makes sense to me. I’ll let @andrewpbray confirm.

andrewpbray commented 5 years ago

Oh, you're right! Hey, that simplifies things somewhat. That's an elegant approach you have there in visualize() using ecdf() and quantile().

I agree with both of your conclusions - PR away!

echasnovski commented 5 years ago

Well, that is not mine approach :) It was there before I got here.

ismayc commented 5 years ago

I’ll take the credit for that 😄

TimHesterberg commented 5 years ago

Andrew, thanks for calling this to my attention. Sorry for the delay - traveling.

I'll argue in favor of CaH.

(1) Methods 1 and 3 are not transformation invariant - you'll get different result if you test using sigma vs sigma^2, or if you test using hazard ratio vs log-hazard-ratio.

As a result, those methods are subject to P-value hacking, where you choose a transformation to get the answer you want.

(2) When you're computing P-values, you're measuring how consistent the data are with the null hypothesis on the probability scale.

The CaH method is consistent with that, the other methods are not.

If the null distribution is not symmetrical, then values that are equally far from \theta_0 on the horizontal-distance scale can correspond to very different values on the probability scale. And I suspect that the density approach is even worse in that regard.

If \hat\theta_1 < \theta_0 < \hat\theta_2, then \hat\theta_1 and \hat\theta_2 give the same two-sided CaH P-value if they are equally far from \theta_0 on the probability scale.

(3) CaH is consistent with how people interpret two-sided P-values. Given a two-sided P-value of 0.04, people interpret that as meaning that the one-sided P-value would be 0.02, not that it would be anywhere between 0.00 and 0.04.

One additional note - you should add 1 to numerator and denominator when computing one-sided P-values: (1 + # resamples with statistic as or more extreme) / (r + 1). In effect, you include the original sample as one of the samples. This prevents a P-value of 0.

echasnovski commented 5 years ago

Wow, thanks for a lot of new insights.

About adding 1. Although it seems to be practically usefull to always not have p-value equal to 0, I personally don't find the justification very compelling:

One can't just "use" original sample as one of the samples from null distribution, as this is what p-value tries to justify.
Consequence of adding 1 is that in order to achieve p-value less than 0.05 one needs at least 20 observations from null distribution. Of course, having less than 20 is a bad idea, but having this "hard-coded" threshold is somewhat hacky.
I also don't think that this approach will be good for teaching.

TimHesterberg commented 5 years ago

(1) In hypothesis testing, the presumption is that the null hypothesis is true unless the data provide strong evidence otherwise. It should be the data that provide that evidence, not simulation error due to using very few samples. And if it is computationally infeasible to use a good number of samples, then you should be somewhat conservative - and adding 1 to numerator and denominator does that.

(2) A P-value of zero is literally impossible. You shouldn't report an impossible P-value.

(3) I agree, this is a detail that you don't want to emphasize in teaching, at least at first or in stat 101. Better to let the software handle it, and just call it a technical detail, to prevent an impossible P-value of 0.0.

One approach to teach it: Suppose you've observed 99 random values, with no ties. Where will the next one be? It could be anywhere - less than the smallest, between any adjacent pair, or above the largest. There's 1/100 chance of it falling in each of those gaps. So depending where the observed statistical falls among 99 random values, the P-value is 1/100, 2/00, etc.

echasnovski commented 5 years ago

I belive that "A P-value of zero is literally impossible" is not true in general case. If, for some reason, there is an assumption of a finite support null distribution (say, uniform on [0, 1]) then p-value 0 can as well be true (if data gives 1.5, for example).

Even with infinite support there is a question of numerical precision: zero p-value can be a more accurate estimate of "truth" than 1 / (# of samples + 1).

luebby commented 5 years ago

I maybe mistaken, but in teaching I find it reasonable to argue in the case none of the sims exceeds the given statistics to say the simulated p-value<1/sims.

andrewpbray commented 5 years ago

Tim, thanks a lot for the feedback! Some thoughts in response:

(1) Transformation invariance: this is a nice property and I hadn't thought of it.

(2) It's consistent with the probability scale: I agree that if we're defining a p-value in terms of "probability of that value or more extreme (on the statistic scale), then we're talking about the cdf. Under that definition, you're right that, if the observed (right tail) stat is \hat{\theta}_R, then left statistic that satisfies this definition would be:

$\hat{\theta}_L = F^{-1}(1- F(\hat{\theta}_R))$

and that

$|F(\hat{\theta}_L) - .5| = | F(\hat{\theta}_R) - .5|$

I suppose my issue is more general: I don't find this to be a very satisfying scale to measure consistency with a particular reference distribution. As an extreme example, consider a distribution that looked like this:

To my eye, the values of the statistic that I'd think are inconsistent with this model are ones near 0. How could we design a test to recognize this is we're working with the probability scale / cdf?

(3) It coheres with intuition that the one-tailed p-val should be half that of the two-tailed. I agree, but I'm sure that this is because 99.9% of people pushing out these two-tailed p-values are working with symmetric distributions, in which case the density approach is identical.

Regarding the adding of one when computing the p-value, thanks for bringing that up too, I had forgotten about that detail from your book.

(1) In hypothesis testing, the presumption is that the null hypothesis is true unless the data provide strong evidence otherwise.

But using the data in the calculation of the p-value (when reps = 0) doesn't allow it to provide evidence against the null does it? I admit to having a hard time understanding what we mean by evidence without working in a Bayesian framework.

I find that when I think about the null hypothesis, I hear a voice in my head like a movie trailer: "In a world where .... the mean is 6!", and the null distribution shows what that world would look like. We're left to assess whether this data that we collected from our world is consistent with that world. This has to be done with consideration that the data can be very vague, in which case consistency is difficult to establish, and a strong statement either way is unwise.

andrewpbray commented 5 years ago

In the interest of resolving the concrete issue of which p-value definition to use, I think going with (2) makes the most sense because it is consistent with the overwhelmingly standard approach to calculating a p-value and for some of the other issues that Tim brings up.

echasnovski commented 5 years ago

To finally close this off: @andrewpbray, @ismayc, what is your opinion about adding 1 to numerator and denominator during computation of one-sided p-value? Should this be done in code? If eventually we're going to do this, it seems wise to do it now, as get_pvalue() already has "broken" behavior.

echasnovski commented 5 years ago

I personally don't like the idea of always not having zero p-value. However, I found one area when it might be a useful feature: multiple hypothesis testing. If p-value is exactly zero then all p-value adjustment methods (as in p.adjust.methods for p.adjust() function) don't change that. In case of really large number of hypothesis (at least around number of resamples during computation of one p-value) always having non-zero p-value might be a good thing to do.

ismayc commented 5 years ago

I think if we change it, we should add a message explaining the calculation or we are going to get lots of people requesting clarification as issues in the repo. I don’t think it is as commonly used, even though the reasons do make sense as Tim has laid out.

ismayc commented 5 years ago

The important point I think is that when we use simulation-based methods, we are approximating a process of permutating all samples (or bootstrapping or simulating a coin flip). So when we say "the probability of observing a sample statistic as extreme or more extreme than what was observed in the sample, assuming the null hypothesis is true", the term "probability" isn't exact. It's dependent on how many samples we generated. This definition of a p-value matches up nicely with normal distribution theory and the non-zero p-value since there will always be a bit of area more extreme even if it is extremely small.

andrewpbray commented 5 years ago

I lean towards not implementing the +1 thing, primarily for simplicity.

echasnovski commented 5 years ago

So I guess the current get_p_value() implementation in develop branch is acceptable (for now).

TimHesterberg commented 5 years ago

Another reason to use +1 is consistency with confidence intervals. When doing bootstrap percentile or bootstrap t confidence intervals, you should use the analog of R's quantile(..., type = 6), in which the i'th largest observation correspond to probability i/(n+1), with interpolation between observations. This corresponds to equal probability 1/(n+1) between each pair of observations, and above and below the extreme observations. Other quantile definitions result in confidence intervals that undercover even more badly.

andrewpbray commented 5 years ago

It looks like currently we use the default in quantile() which is type = 7.

Hmm, a very simple fix would be to allow the passage of alternative type arguments to both the CI and p-val functions. This is easy enough for the CI functions. The p-value functions, though, would need to be switched over to the inverse of the quantile() function, which is ecdf(), but it looks there aren't alternative types for that. We could write our own inv_quantile() function to do that and start by implementing just types 6 and 7.

Thoughts?

echasnovski commented 5 years ago

Computation of p-value (with get_p_value()) in develop branch doesn't depend on quantile() at all. And I believe it shouldn't as current implementation seems to be the closest to p-value estimation.

About CI functions, there are several problems:

I don't really understand why get_confidence_interval() "should" use the analog of quantile(*, type = 6). Is there a common situation where the difference (with default type = 7) is crucial?
Confidence intervals can be computed not only for bootstrap samples, but also for results of generate(*, type = "permute") and generate(*, type = "simulate"). We, of course, can use different types of quantile() for different generate() type but is it really necessary?

echasnovski commented 5 years ago

And maybe it is better to create new issue for any kind of update suggestion? As this was originally intended to discuss two-sided p-value.

andrewpbray commented 5 years ago

I think that makes sense: close out this issue then open a new one with variable type as a feature request. We'd likely set the defaults at their current behavior, so a change down the line won't change the existing methods. If enough users are expecting this behavior, it could be handy to build it in at some point.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / infer

Change computation of two-sided p-value #205

p-value computation

p-value shading