kassambara / rstatix

Pipe-friendly Framework for Basic Statistical Tests in R
https://rpkgs.datanovia.com/rstatix/
444 stars 50 forks source link

Some issues with p.adjust.method and combining p.values from multiple tests #28

Closed barrel0luck closed 4 years ago

barrel0luck commented 4 years ago

Hi again, I'm trying to use your package for stuff I normally do in R. And so as I face difficulties, I'm posting here. If I'm doing something obviously wrong apologies again. I've three issues this time:

  1. In t_test although F1 says that p.adjust.method = "holm" by default, it doesn't seem to be correcting p.values by default. In fact the p.adjust.method seems to be broken for this function and could potentially be a bug.
  2. In pairwise_t_test, setting p.adjust.method = "none" still returns p.adj and p.adj.signif columns (eventhough these don't have adjusted values)
  3. I'm trying to combine two different tests. In the first test I perform t_test with mu = 100. In the second test I perform pairwise tests using either a list of comparisons or pairwise_t_test. So my requirement here is that first perform the two tests without p value adjustment. Then combine the tables and adjust the p.values (so that p values are adjusted across all tests rather than just within a single test). And then finally plot these in a plot. Here's the code I used in the previous issue as an example code here again with the issues commented. I use t_test / wilcox_test. I think the issues are reproducible in the both of them (and maybe other tests that I've not checked). Again, if I'm doing something obviously wrong as last time, apologies in advance!
    
    ####### Problems with output of t_test --------

Required package --------------------

library(ggpubr) library(rstatix) # rstatix version: 0.4.0

I'm on Fedora 31 with R updated and all packages updated, if that detail is relevant

Generate dataset ------------------------

set.seed(24) labels <- c("A", "B", "C", "D", "E", "F", "G")

test_dataset <- data.frame(label = NULL, value = NULL) means_vector <- c(75, 125, 75, 55, 45, 35, 25) for (i in 1:length(means_vector)) { ran_nums <- rnorm(n = 20, mean = means_vector[i], sd = 50) temp_data <- data.frame(label = labels[i], value = ran_nums) test_dataset <- rbind(test_dataset, temp_data) }

Perform the test ------------------------

PROBLEM 1

These two give the same results, eventhough by default p.adjust.method = "holm"

(When I press F1 on wilcox_test this is what is shown regarding p.adjust.method:

wilcox_test(data, formula, comparisons = NULL, ref.group = NULL,

p.adjust.method = "holm", paired = FALSE, exact = NULL,

alternative = "two.sided", mu = 0, conf.level = 0.95,

detailed = FALSE)

)

test_dataset_pvalues_1 <- test_dataset %>% group_by(label) %>% wilcox_test(value ~ 1, mu = 100)

test_dataset_pvalues_2 <- test_dataset %>% group_by(label) %>% wilcox_test(value ~ 1, mu = 100, p.adjust.method = "none")

test_dataset_pvalues_1 == test_dataset_pvalues_2 # shows both are identical

These two give same results

test_dataset_pvalues_1 <- test_dataset %>% group_by(label) %>% t_test(value ~ 1, mu = 100)

test_dataset_pvalues_2 <- test_dataset %>% group_by(label) %>% t_test(value ~ 1, mu = 100, p.adjust.method = "none")

test_dataset_pvalues_1 == test_dataset_pvalues_2 # shows both are identical

These two give same results

test_dataset_pvalues_1 <- test_dataset %>% group_by(label) %>% t_test(value ~ 1, mu = 100, p.adjust.method = "holm")

test_dataset_pvalues_2 <- test_dataset %>% group_by(label) %>% t_test(value ~ 1, mu = 100, p.adjust.method = "none")

test_dataset_pvalues_1 == test_dataset_pvalues_2 # shows both are identical

These two give same results

test_dataset_pvalues_1 <- test_dataset %>% group_by(label) %>% t_test(value ~ 1, mu = 100, p.adjust.method = "holm")

test_dataset_pvalues_2 <- test_dataset %>% group_by(label) %>% t_test(value ~ 1, mu = 100, p.adjust.method = "BH")

test_dataset_pvalues_1 == test_dataset_pvalues_2 # shows both are identical

So basically the p.adjust.method doesn't seem to be functioning

PROBLEM 2

In pairwise_t_test when asked for not adjusting p values, this setting is respected, but p.adj columns are returned anyway (with non adjusted pvalues)

If this is intentional behaviour - that's okay (although it could be misleading) - but this behaviour seems to be different from what you have going for the t_test function where these extra p.adj columns are not given out

test_dataset_pvalues_1 <- test_dataset %>%

group_by(label) %>% # Why does uncommenting this line give an error?

pairwise_t_test(value~label)

test_dataset_pvalues_2 <- test_dataset %>%

group_by(label) %>% # Why does uncommenting this line give an error?

pairwise_t_test(value~label, p.adjust.method = "holm")

test_dataset_pvalues_1 == test_dataset_pvalues_2 # shows both are identical AS IT SHOULD IN THIS CASE

However, when setting p.adjust.method = "none" although it works, why are the p.adj and p.adj.signif columns returned?

These columns have the same p values as the non adjusted one - so the correction has not taken place as asked, but the columns are returned anyway when they should not be right?

test_dataset_pvalues_3 <- test_dataset %>%

group_by(label) %>% # Why does uncommenting this line give an error?

pairwise_t_test(value~label, p.adjust.method = "none")

PROBLEM 3

How do I combine results of tests for performing multiple testing correction and plotting in same plot?

For example, in my random dataset generated here, I need to perform first comparisons with a mu = 100

test_dataset_pvalues_1 <- test_dataset %>% group_by(label) %>% wilcox_test(value ~ 1, mu = 100)

Then I need to make some pairwise comparisons

comparisons_list <- list(c("A", "B"), c("A", "D"), c("A", "F")) test_dataset_pvalues_2 <- test_dataset %>%

group_by(label) %>%

wilcox_test(value ~ label, comparisons = comparisons_list, p.adjust.method = "none")

Or I need to make all pairwise comparisons

test_dataset_pvalues_2 <- test_dataset %>%

group_by(label) %>%

pairwise_wilcox_test(value ~ label, p.adjust.method = "none")

So now I need to report both test_dataset_pvalues_1 and 2 in my plot

But first I need to combine them and perform multiple testing correction across all tests

Currently I can only perform multiple testing correction within the set of tests performed, but often you need to perform multiple testing correction across all tests performed right?

There's no easy way to combine test_dataset_pvalues_1 and 2 because of two reasons:

The extra p.adj columns even with multiple testing is not performed

And the n1 n2 columns issue which would be absent in the tests performed with mu = 100 (table 1)

So how would you recommend this complex testing paradigm be achieved in a pipe?

Having one common table will also aid their addition to the plot

Plot the data -------------------------------

myplot <- ggplot(test_dataset, aes(x = label, y = value)) + geom_boxplot() + geom_hline(yintercept = 100, linetype = 2, colour = "red")

Add p-values on the plot -----------------------

Right now I've just added to two tables without pvalue adjustment for this example.

I'd like to combine the two tables, adjust pvalues and then plot them together.

test_dataset_pvalues_1 <- test_dataset_pvalues_1 %>% add_xy_position(x = "label")

myplot + stat_pvalue_manual(test_dataset_pvalues_1, label = "p", remove.bracket = TRUE, hjust = 1) + stat_pvalue_manual(test_dataset_pvalues_2, label = "p", remove.bracket = FALSE, y.position = 200) # How to auto get y.position here?

kassambara commented 4 years ago

PROBLEM 1

This is not a real problem. The obtained output is expected by design. As explained, in the details section of the function, for a grouped data, the p-value adjustment is performed for each group level independently.

In your example, there is only one comparaison by label. So, the p-value remains unchanged no matter whether you specify or not the option p.adjust.method.

In the following example, we have 3 comparisons per supp levels, so the multiple testing correction is applied.

library(rstatix)
# Data
df <- ToothGrowth
df$dose <- factor(Tdf$dose)
#> Error in factor(Tdf$dose): objet 'Tdf' introuvable
# Test 1
df %>%
  group_by(supp) %>%
  wilcox_test(len ~ dose)
#> # A tibble: 6 x 10
#>   supp  .y.   group1 group2    n1    n2 statistic       p   p.adj
#> * <fct> <chr> <chr>  <chr>  <int> <int>     <dbl>   <dbl>   <dbl>
#> 1 OJ    len   0.5    1         10    10       7.5 1.00e-3 3.00e-3
#> 2 OJ    len   0.5    2         10    10       0   1.81e-4 5.43e-4
#> 3 OJ    len   1      2         10    10      26.5 8.20e-2 8.20e-2
#> 4 VC    len   0.5    1         10    10       0   1.80e-4 5.40e-4
#> 5 VC    len   0.5    2         10    10       0   1.82e-4 5.40e-4
#> 6 VC    len   1      2         10    10       3   4.35e-4 5.40e-4
#> # … with 1 more variable: p.adj.signif <chr>

# Test 2
df %>%
  group_by(supp) %>%
  wilcox_test(len ~ dose, p.adjust.method = "none")
#> # A tibble: 6 x 10
#>   supp  .y.   group1 group2    n1    n2 statistic       p   p.adj
#> * <fct> <chr> <chr>  <chr>  <int> <int>     <dbl>   <dbl>   <dbl>
#> 1 OJ    len   0.5    1         10    10       7.5 1.00e-3 1.00e-3
#> 2 OJ    len   0.5    2         10    10       0   1.81e-4 1.81e-4
#> 3 OJ    len   1      2         10    10      26.5 8.20e-2 8.20e-2
#> 4 VC    len   0.5    1         10    10       0   1.80e-4 1.80e-4
#> 5 VC    len   0.5    2         10    10       0   1.82e-4 1.82e-4
#> 6 VC    len   1      2         10    10       3   4.35e-4 4.35e-4
#> # … with 1 more variable: p.adj.signif <chr>

Created on 2020-02-17 by the reprex package (v0.3.0)

PROBLEM 2

This was also by design to have a fixed output format for pairwise comparisons, where p-values are supposed to be always adjusted.

PROBLEM 3

From my point of view, it makes no sens to combine the p-values of two completely different tests and to finally adjust them together.

So, I would go as follow:


# Required package --------------------
library(ggpubr)
#> Le chargement a nécessité le package : ggplot2
#> Le chargement a nécessité le package : magrittr
library(rstatix) # rstatix version: 0.4.0
#> 
#> Attachement du package : 'rstatix'
#> The following object is masked from 'package:stats':
#> 
#>     filter
# I'm on Fedora 31 with R updated and all packages updated, if that detail is relevant

# Generate dataset ------------------------
set.seed(24)
labels <- c("A", "B", "C", "D", "E", "F", "G")

test_dataset <- data.frame(label = NULL, value = NULL)
means_vector <- c(75, 125, 75, 55, 45, 35, 25)
for (i in 1:length(means_vector)) {
  ran_nums <- rnorm(n = 20, mean = means_vector[i], sd = 50)
  temp_data <- data.frame(label = labels[i], value = ran_nums)
  test_dataset <- rbind(test_dataset, temp_data)
}

# Tests -------------------
test_dataset_pvalues_1 <- test_dataset %>% 
  group_by(label) %>% 
  wilcox_test(value ~ 1, mu = 100) %>%
  adjust_pvalue(method = "BH") 

# Then I need to make some pairwise comparisons
comparisons_list <- list(c("A", "B"), c("A", "D"), c("A", "F"))
test_dataset_pvalues_2 <- test_dataset %>% 
  wilcox_test(value ~ label, comparisons = comparisons_list, p.adjust.method = "BH")

# Plot the data -------------------------------
myplot <- ggplot(test_dataset, aes(x = label, y = value)) +
  geom_boxplot() +
  geom_hline(yintercept = 100, linetype = 2, colour = "red") 

# 1) Plot  + One-sample test result
test_dataset_pvalues_1  <- test_dataset_pvalues_1 %>%
  add_xy_position(x = "label") %>%
  p_round(p.adj, digits = 2)
myplot + 
  stat_pvalue_manual(test_dataset_pvalues_1, label = "p.adj", 
                     remove.bracket = TRUE, hjust = 1) 


# 2) Plot  + Two-sample test result
test_dataset_pvalues_2  <- test_dataset_pvalues_2 %>%
  add_xy_position(x = "label")
myplot + 
  stat_pvalue_manual(test_dataset_pvalues_2, label = "p.adj", tip.length = 0) 

Created on 2020-02-18 by the reprex package (v0.3.0)

barrel0luck commented 4 years ago

Ok, this is the murkier part of multiple testing correction for me. I always multiple testing correct within an experiment and use only one kind of test within an experiment, i.e. a t_test (or wilcox_text). As I understand multiple testing correction, if you have a result that is barely significant at alpha = 0.05 then you expect 5% of all your tests performed to be falsely significant. So to overcome this issue, by performing multiple testing correction you're reducing the alpha threshold in effect (in some manner that depends on the numbers of tests performed) by adjusting the p value to make it higher. So, in this box plot if I'm plotting values that are generated to a reference label (not in the data as it's always at 100) then I need to:

  1. Perform the t_test relative to the reference (mu = 100) to see if a difference exists relative to reference, and then
  2. Perform pairwise tests. As I see it from the number of tests perspective, both tests are the same and need in toto multiple testing correction. This way is more conservative so it shouldn't be an issue per se. I think I need to understand p value adjustment a bit better. Can you point to any sources that are not obscure statistics books? Thanks anyway.
barrel0luck commented 4 years ago

The x positions calculated seem to be incorrect when making custom comparisons and not starting with the first label.

# Required package --------------------
library(ggpubr)
library(rstatix)

# Generate dataset ------------------------
set.seed(24)
labels <- c("A", "B", "C", "D", "E", "F", "G")

test_dataset <- data.frame(label = NULL, value = NULL)
means_vector <- c(75, 125, 75, 55, 45, 35, 25)
for (i in 1:length(means_vector)) {
  ran_nums <- rnorm(n = 20, mean = means_vector[i], sd = 50)
  temp_data <- data.frame(label = labels[i], value = ran_nums)
  test_dataset <- rbind(test_dataset, temp_data)
}

comparisons_list <- list(c("D", "B"), c("C", "D"), c("B", "C"))
test_dataset_pvalues <- test_dataset %>% 
  wilcox_test(value ~ label, comparisons = comparisons_list) %>% 
  add_xy_position(x = "label")

ggplot(test_dataset, aes(x = label, y = value)) +
  geom_boxplot() +
  geom_hline(yintercept = 100, linetype = 2, colour = "red") +
  stat_pvalue_manual(test_dataset_pvalues, label = "p.adj")

image

So here, even if the comparisons are D vs B, C vs D and B vs C the brackets are drawn from A onwards. Of note, stat_compare_means perfectly manages to add the brackets to the correct position and I could possibly use that, but I don't think it corrects for multiple testing. Is that correct?

kassambara commented 4 years ago

Thank you for reporting this issue, I will fix it as soon as possible

kassambara commented 4 years ago

fixed now, thanks!

barrel0luck commented 4 years ago

Thanks for the fix! Also, can stat_compare_means perform multiple testing?

kassambara commented 4 years ago

stat_compare_means if from the ggpubr package. It doesn't handle multiple testing correction. I'll work on this; Related issue: https://github.com/kassambara/ggpubr/issues/119