kassambara / rstatix

Pipe-friendly Framework for Basic Statistical Tests in R
https://rpkgs.datanovia.com/rstatix/
440 stars 50 forks source link

Management of missing values in a paired t_test or pairwise_t_test: effect on sample size and beyond #175

Open fpantin opened 1 year ago

fpantin commented 1 year ago

Hello,

1) when NA values are provided to t_test() or pairwise_t_test(), they are not discarded to estimate the sample size. One example:

# Import data
data("ToothGrowth")
df <- ToothGrowth

# Perform (paired) t-test
df %>% t_test (len ~ supp)
df %>% t_test (len ~ supp, paired = TRUE)

# Replace one observation by NA
df$len[1] <- NA

# Redo (paired) t-test
df %>% t_test (len ~ supp)
df %>% t_test (len ~ supp, paired = TRUE)

One can see that the sample size is not affected. Could that be corrected? I think this point was also raised in the issue #147, but also in the solved issue #104 for wilcox_test(). I guess, though, that the correct sample size is used to compute the t statistic, because the results are the same as with the base R function:

x <- df$len[df$supp == "OJ"]
y <- df$len[df$supp == "VC"]
t.test(x, y)
t.test(x, y, paired = T)

2) For paired tests, if one observation is NA in group 1, does the function:

3) As far as I understand, the pairing is done depending on the order of each observation in the dataset within each group. It would be great to add an argument for the user to supply the column for pairing observations, just like the wid argument for mixed ANOVA in anova_test().

4) Speaking about mixed ANOVA with anova_test(), same kind of question as in point 2. Let's take the example you developed here:

data("anxiety", package = "datarium")
anxiety <- anxiety %>%
  gather(key = "time", value = "score", t1, t2, t3) %>%
  convert_as_factor(id, time)
res.aov <- anova_test(
  data = anxiety, dv = score, wid = id,
  between = group, within = time)
get_anova_table(res.aov)

And let's replace the first observation at time = t1 by NA:

anxiety$score[1] <- NA
res.aov <- anova_test(
  data = anxiety, dv = score, wid = id,
  between = group, within = time)
get_anova_table(res.aov)

Is there any imputation to estimate the effect of the within-subject factor time? Would you still do a paired pairwise t-test in this case? If yes:

Thank you for developing this package, this is really appreciated.

Cheers,

Florent