「R」管道统计分析——rstatix使用指南

「R」管道统计分析——rstatix使用指南 by 优雅R

这篇文章是 rstatix 包 README 的介绍，这个包它可以解决我们在使用 tidyverse 进行统计分析和绘图时一些痛点问题。包网址：https://github.com/kassambara/rstatix （点击原文也可以跳转）。如果你对使用的统计分析流程不熟悉，今天的第二篇文章里面的截图可以作为平时使用的参考。

rstatix 包提供了一个与「tidyverse」设计哲学一致的简单且直观的管道友好框架用于执行基本的统计检验，包括 t 检验、Wilcoxon 检验、ANOVA、Kruskal-Wallis 以及相关分析。

每个检验的输出都会自动转换为干净的数据框以便于可视化。

另外也提供了一些用于重塑、重排、操作以及可视化相关矩阵的函数。也包含一些方便因子实验分析的函数，包括 ‘within-Ss’ 设计 (repeated measures), purely ‘between-Ss’ 设计以及 mixed ‘within-and-between-Ss’ 设计。

该包也可以用于计算一些效应值度量，包括 “eta squared” for ANOVA, “Cohen’s d” for t-test and “Cramer’s V” for the association between categorical variables。该包还包含一些用于识别单变量和多变量离群点、评估变异正态性和异质性的帮助函数。

核心函数

描述统计量

get_summary_stats(): Compute summary statistics for one or multiple numeric variables. Can handle grouped data.
freq_table(): Compute frequency table of categorical variables.
get_mode(): Compute the mode of a vector, that is the most frequent values.
identify_outliers(): Detect univariate outliers using boxplot methods.
mahalanobis_distance(): Compute Mahalanobis Distance and Flag Multivariate Outliers.
shapiro_test() and mshapiro_test(): Univariate and multivariate Shapiro-Wilk normality test.

比较均值

t_test(): perform one-sample, two-sample and pairwise t-tests
wilcox_test(): perform one-sample, two-sample and pairwise Wilcoxon tests
sign_test(): perform sign test to determine whether there is a median difference between paired or matched observations.
anova_test(): an easy-to-use wrapper around car::Anova() to perform different types of ANOVA tests, including independent measures ANOVA, repeated measures ANOVA and mixed ANOVA.
kruskal_test(): perform kruskal-wallis rank sum test
tukey_hsd(): performs tukey post-hoc tests. Can handle different inputs formats: aov, lm, formula.
dunn_test(): compute multiple pairwise comparisons following Kruskal-Wallis test.
emmeans_test(): pipe-friendly wrapper arround emmeans function to perform pairwise comparisons of estimated marginal means. Useful for post-hoc analyses following up ANOVA/ANCOVA tests.
get_comparisons(): Create a list of possible pairwise comparisons between groups.
get_pvalue_position: autocompute p-value positions for plotting significance using ggplot2.

促进R的ANOVA计算

factorial_design(): build factorial design for easily computing ANOVA using the car::Anova() function. This might be very useful for repeated measures ANOVA, which is hard to set up with the car package.
anova_summary(): Create beautiful summary tables of ANOVA test results obtained from either car::Anova() or stats::aov(). The results include ANOVA table, generalized effect size and some assumption checks, such as Mauchly’s test for sphericity in the case of repeated measures ANOVA.

比较方差

levene_test(): Pipe-friendly framework to easily compute Levene’s test for homogeneity of variance across groups. Handles grouped data.
box_m(): Box’s M-test for homogeneity of covariance matrices

效应值

cohens_d(): Compute cohen’s d measure of effect size for t-tests.
eta_squared() and partial_eta_squared(): Compute effect size for ANOVA.
cramer_v(): Compute Cramer’s V, which measures the strength of the association between categorical variables.

矫正p值和添加显著性标记

adjust_pvalue(): add an adjusted p-values column to a data frame containing statistical test p-values
add_significance(): add a column containing the p-value significance level

其他

doo(): alternative to dplyr::do for doing anything. Technically it uses nest() + mutate() + map() to apply arbitrary computation to a grouped data frame.
sample_n_by(): sample n rows by group from a table
convert_as_factor(), set_ref_level(), reorder_levels(): Provides pipe-friendly functions to convert simultaneously multiple variables into a factor variable.
make_clean_names(): Pipe-friendly function to make syntactically valid column names (for input data frame) or names (for input vector).

安装和加载

从GitHub上安装开发版本：

if(!require(devtools)) install.packages("devtools")devtools::install_github("kassambara/rstatix")

或者从CRAN安装稳定版本：

install.packages("rstatix")

加载包

library(rstatix)  #> #> 载入程辑包：'rstatix'#> The following object is masked from 'package:stats':#> #>     filterlibrary(ggpubr)  # For easy data-visualization#> 载入需要的程辑包：ggplot2#> 载入需要的程辑包：magrittr

描述统计量

# Summary statistics of some selected variables#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::iris %>%   get_summary_stats(Sepal.Length, Sepal.Width, type = "common")#> # A tibble: 2 x 10#>   variable         n   min   max median   iqr  mean    sd    se    ci#>   <chr>        <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 Sepal.Length   150   4.3   7.9    5.8   1.3  5.84 0.828 0.068 0.134#> 2 Sepal.Width    150   2     4.4    3     0.5  3.06 0.436 0.036 0.07
# Whole data frame#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::iris %>% get_summary_stats(type = "common")#> # A tibble: 4 x 10#>   variable         n   min   max median   iqr  mean    sd    se    ci#>   <chr>        <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 Petal.Length   150   1     6.9   4.35   3.5  3.76 1.76  0.144 0.285#> 2 Petal.Width    150   0.1   2.5   1.3    1.5  1.20 0.762 0.062 0.123#> 3 Sepal.Length   150   4.3   7.9   5.8    1.3  5.84 0.828 0.068 0.134#> 4 Sepal.Width    150   2     4.4   3      0.5  3.06 0.436 0.036 0.07

# Grouped data#::::::::::::::::::::::::::::::::::::::::::::::::::::::::::iris %>%  group_by(Species) %>%   get_summary_stats(Sepal.Length, type = "mean_sd")#> # A tibble: 3 x 5#>   Species    variable         n  mean    sd#>   <fct>      <chr>        <dbl> <dbl> <dbl>#> 1 setosa     Sepal.Length    50  5.01 0.352#> 2 versicolor Sepal.Length    50  5.94 0.516#> 3 virginica  Sepal.Length    50  6.59 0.636

比较均值

你可以使用 t_test() (parametric) or wilcox_test() (non-parametric，实际比较的是中位数) 比较均值差异。下面使用 t 检验进行示范。

数据

导入样例数据集：

df <- ToothGrowthdf$dose <- as.factor(df$dose)head(df)#>    len supp dose#> 1  4.2   VC  0.5#> 2 11.5   VC  0.5#> 3  7.3   VC  0.5#> 4  5.8   VC  0.5#> 5  6.4   VC  0.5#> 6 10.0   VC  0.5

比较2个独立组别

创建一个带p值的箱线图

# T-teststat.test <- df %>%   t_test(len ~ supp, paired = FALSE) stat.test#> # A tibble: 1 x 8#>   .y.   group1 group2    n1    n2 statistic    df      p#> * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>  <dbl>#> 1 len   OJ     VC        30    30      1.92  55.3 0.0606
# Create a box plotp <- ggboxplot(  df, x = "supp", y = "len",   color = "supp", palette = "jco", ylim = c(0,40)  )# Add the p-value manuallyp + stat_pvalue_manual(stat.test, label = "p", y.position = 35)

使用 glue expression 自定义标签：

p +stat_pvalue_manual(stat.test, label = "T-test, p = {p}",                       y.position = 36)

分组数据：在按照「dose」分组后比较 supp 水平：

# Statistical teststat.test <- df %>%  group_by(dose) %>%  t_test(len ~ supp) %>%  adjust_pvalue() %>%  add_significance("p.adj")stat.test#> # A tibble: 3 x 11#>   dose  .y.   group1 group2    n1    n2 statistic    df       p   p.adj#>   <fct> <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>   <dbl>   <dbl>#> 1 0.5   len   OJ     VC        10    10    3.17    15.0 0.00636 0.0127 #> 2 1     len   OJ     VC        10    10    4.03    15.4 0.00104 0.00312#> 3 2     len   OJ     VC        10    10   -0.0461  14.0 0.964   0.964  #> # … with 1 more variable: p.adj.signif <chr>
# Visualizationggboxplot(  df, x = "supp", y = "len",  color = "supp", palette = "jco", facet.by = "dose",  ylim = c(0, 40)  ) +  stat_pvalue_manual(stat.test, label = "p.adj", y.position = 35)

比较配对样本

# T-teststat.test <- df %>%   t_test(len ~ supp, paired = TRUE) stat.test#> # A tibble: 1 x 8#>   .y.   group1 group2    n1    n2 statistic    df       p#> * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>   <dbl>#> 1 len   OJ     VC        30    30      3.30    29 0.00255
# Box plotp <- ggpaired(  df, x = "supp", y = "len", color = "supp", palette = "jco",   line.color = "gray", line.size = 0.4, ylim = c(0, 40)  )p + stat_pvalue_manual(stat.test, label = "p", y.position = 36)

成对比较

成对比较：如果分组变量包含多于2个分类，会自动执行成对比较

# Pairwise t-testpairwise.test <- df %>% t_test(len ~ dose)pairwise.test#> # A tibble: 3 x 10#>   .y.   group1 group2    n1    n2 statistic    df        p    p.adj#> * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>    <dbl>    <dbl>#> 1 len   0.5    1         20    20     -6.48  38.0 1.27e- 7 2.54e- 7#> 2 len   0.5    2         20    20    -11.8   36.9 4.40e-14 1.32e-13#> 3 len   1      2         20    20     -4.90  37.1 1.91e- 5 1.91e- 5#> # … with 1 more variable: p.adj.signif <chr># Box plotggboxplot(df, x = "dose", y = "len")+  stat_pvalue_manual(    pairwise.test, label = "p.adj",     y.position = c(29, 35, 39)    )

基于参考组的成对比较

# Comparison against reference group#::::::::::::::::::::::::::::::::::::::::# T-test: each level is compared to the ref groupstat.test <- df %>% t_test(len ~ dose, ref.group = "0.5")stat.test#> # A tibble: 2 x 10#>   .y.   group1 group2    n1    n2 statistic    df        p    p.adj#> * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>    <dbl>    <dbl>#> 1 len   0.5    1         20    20     -6.48  38.0 1.27e- 7 1.27e- 7#> 2 len   0.5    2         20    20    -11.8   36.9 4.40e-14 8.80e-14#> # … with 1 more variable: p.adj.signif <chr># Box plotggboxplot(df, x = "dose", y = "len", ylim = c(0, 40)) +  stat_pvalue_manual(    stat.test, label = "p.adj.signif",     y.position = c(29, 35)    )

# Remove bracketggboxplot(df, x = "dose", y = "len", ylim = c(0, 40)) +  stat_pvalue_manual(    stat.test, label = "p.adj.signif",     y.position = c(29, 35),    remove.bracket = TRUE    )

基于总体水平的成对比较

# T-teststat.test <- df %>% t_test(len ~ dose, ref.group = "all")stat.test#> # A tibble: 3 x 10#>   .y.   group1 group2    n1    n2 statistic    df       p   p.adj#> * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>   <dbl>   <dbl>#> 1 len   all    0.5       60    20     5.82   56.4 2.90e-7 8.70e-7#> 2 len   all    1         60    20    -0.660  57.5 5.12e-1 5.12e-1#> 3 len   all    2         60    20    -5.61   66.5 4.25e-7 8.70e-7#> # … with 1 more variable: p.adj.signif <chr># Box plot with horizontal mean lineggboxplot(df, x = "dose", y = "len") +  stat_pvalue_manual(    stat.test, label = "p.adj.signif",     y.position = 35,    remove.bracket = TRUE    ) +  geom_hline(yintercept = mean(df$len), linetype = 2)

ANOVA 检验

# One-way ANOVA test#:::::::::::::::::::::::::::::::::::::::::df %>% anova_test(len ~ dose)#> Coefficient covariances computed by hccm()#> ANOVA Table (type II tests)#> #>   Effect DFn DFd    F        p p<.05   ges#> 1   dose   2  57 67.4 9.53e-16     * 0.703
# Two-way ANOVA test#:::::::::::::::::::::::::::::::::::::::::df %>% anova_test(len ~ supp*dose)#> Coefficient covariances computed by hccm()#> ANOVA Table (type II tests)#> #>      Effect DFn DFd     F        p p<.05   ges#> 1      supp   1  54 15.57 2.31e-04     * 0.224#> 2      dose   2  54 92.00 4.05e-18     * 0.773#> 3 supp:dose   2  54  4.11 2.20e-02     * 0.132
# Two-way repeated measures ANOVA#:::::::::::::::::::::::::::::::::::::::::df$id <- rep(1:10, 6) # Add individuals id# Use formula# df %>% anova_test(len ~ supp*dose + Error(id/(supp*dose)))# or use character vectordf %>% anova_test(dv = len, wid = id, within = c(supp, dose))#> ANOVA Table (type III tests)#> #> $ANOVA#>      Effect DFn DFd      F        p p<.05   ges#> 1      supp   1   9  34.87 2.28e-04     * 0.224#> 2      dose   2  18 106.47 1.06e-10     * 0.773#> 3 supp:dose   2  18   2.53 1.07e-01       0.132#> #> $`Mauchly's Test for Sphericity`#>      Effect     W     p p<.05#> 1      dose 0.807 0.425      #> 2 supp:dose 0.934 0.761      #> #> $`Sphericity Corrections`#>      Effect   GGe      DF[GG]    p[GG] p[GG]<.05  HFe      DF[HF]    p[HF]#> 1      dose 0.838 1.68, 15.09 2.79e-09         * 1.01 2.02, 18.15 1.06e-10#> 2 supp:dose 0.938 1.88, 16.88 1.12e-01           1.18 2.35, 21.17 1.07e-01#>   p[HF]<.05#> 1         *#> 2
# Use model as arguments#:::::::::::::::::::::::::::::::::::::::::.my.model <- lm(yield ~ block + N*P*K, npk)anova_test(.my.model)#> Coefficient covariances computed by hccm()#> Note: model has aliased coefficients#>       sums of squares computed by model comparison#> ANOVA Table (type II tests)#> #>   Effect DFn DFd      F     p p<.05   ges#> 1  block   4  12  4.959 0.014     * 0.623#> 2      N   1  12 12.259 0.004     * 0.505#> 3      P   1  12  0.544 0.475       0.043#> 4      K   1  12  6.166 0.029     * 0.339#> 5    N:P   1  12  1.378 0.263       0.103#> 6    N:K   1  12  2.146 0.169       0.152#> 7    P:K   1  12  0.031 0.863       0.003#> 8  N:P:K   0  12     NA    NA  <NA>    NA

ixxmu / mp_duty