使用Clipper控制FDR应对高通量数据分析p值失真问题 by 生信菜鸟团

上期推文我们讨论了，在DESeq2的分析结果中，一些p值，特别是adj.p出现NA的情况。关于多重检验的p值，我们经常使用FDR （Benjamini and Hochberg（BH））方法进行校正，但其实我们平时常用的校正方法对数据分布和样本量都有要求，而我们通过高通量测序得到的表达矩阵往往是样本量较小、特征维度高的数据，导致我们在使用这些工具方法时容易忽略统计学意义上的严谨性。美国加州大学洛杉矶分校统计系的李婧翌团队开发的Clipper工具使用户能在无需计算p值和假设参数模型的情况下直接控制高通量数据分析中的假阳性率，提高分析结果的可靠度，本期我们将从差异表达分析中的FDR入手，学习这个工具。

FDR

我们往期有许多推文介绍了各种进行多重检验矫正的方法，其中就包括了如何计算FDR

一文了解P-value，多重比较，FDR和Q value的差别

转录组差异分析P值与FDR值区别有多大

p.value和FDR

首先我们需亚了解差异表达分析的基本假设：

H0:差别是由抽样误差所致；
H1:差别不是由抽样误差所致，即并不来自同一总体

###这里我用到哈佛大学统计的一个数据集
library(devtools)
install_github("genomicsclass/GSE5859Subset")

##加载对应的包和数据

library(qvalue)
library(GSE5859Subset)
data(GSE5859Subset)

### 查看数据
dim(geneExpression)

看看一个单一的基因:

g<- sampleInfo$group

e<- geneExpression[25,]
e

# t-test, expression should be normal distribution
# 正态分布假设
qqnorm(e[g==1])
qqline(e[g==1])
qqnorm(e[g==0])
qqline(e[g==0])

对其进行t-test：

t.test(e[g==1], e[g==0])

 Welch Two Sample t-test

data:  e[g == 1] and e[g == 0]
t = 0.28382, df = 21.217, p-value = 0.7793
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1431452  0.1884244
sample estimates:
mean of x mean of y 
 10.52505  10.50241

对所有基因进行 t 检验

mytest<- function(x) t.test(x[g==1], x[g==0], var.equal=T)$p.value

pvals<- apply(geneExpression, 1, mytest)

sum(pvals< 0.05)  # 统计p值小于0.05的基因个数

[1] 1383

看看 p 值分布

hist(pvals)

用随机数据模拟多重比较

m<- nrow(geneExpression)
n<- ncol(geneExpression)

# generate random numbers
randomData<- matrix(rnorm(n*m), m, n)
nullpvalues<- apply(randomData, 1, mytest)
hist(nullpvalues)

我们随机生成数据，应该没有表达的基因。然而，我们看到了几乎均匀分布的 p 值。

所以，控制多重比较的假阳性是十分必要的

常见方法：

Bonferroni 校正

直接用p值除以进行比较的次数就得到校正后p值，但这种方法非常保守，一般用于全基因组关联研究 (GWAS)

FDR （Benjamini and Hochberg（BH））

错误发现率FDR= 0.05。这意味着在所有发现中，预计有 5% 是假阳性。

一种控制 FDR 的方法：让 k 成为最大的 i，使得p(i) <= (i/m) * alpha，（m 是比较次数）然后拒绝 H(i) for i =1, 2, …k

通过下面的图，可以更好理解

## order the pvals computed above and plot it.
alpha = 0.05
m = length(pvals)
#m is the number of 8793 comparisons 

plot(x=seq(1,100), y=pvals[order(pvals)][1:100])
abline(a=0, b=alpha/m)  # y = alpha/m * x + 0 (x = 1,2,3...)
title("slop is alpha/m")

# let's zoom in to look at the first 15 p values from small to big

plot(x=seq(1,100), y=pvals[order(pvals)][1:100], xlim=c(1,15))
abline(a=0, b=alpha/m)
title("slop is alpha/m")

我们可以看到第 14 个 p 值大于它自己的阈值，由 (0.05/m) * 14 = 7.960878e-05 计算

# current threshold
(0.05/m)*14

# current pvalue
pvals[order(pvals)][14]  # 比threshold大

[1] 7.960878e-05
 202556_s_at 
8.503827e-05

我们将使用 p.adjust 函数和方法“fdr”或“BH”来纠正p 值

p(i)<= (i/m) * alpha

p(i) * m/i <= alpha

p_adjusted(i) = p(i) * (m/i)，其中 p(i) 是原始的第 i 个 p 值，m 是的检验个数

sum( p.adjust(pvals, method="fdr") < 0.05 )  # 13

sum( p.adjust(pvals, method="BH") < 0.05 )  # 13

从图中看到的一样

adj.pvals <- p.adjust(pvals, method="fdr")
compare_list <- data.frame(org.pvalue = pvals[order(pvals)][1:13],
           adj.pvalue = adj.pvals[order(adj.pvals)][1:13])

# p_adjusted(i) = p(i) * (m/i)
i <- seq(1:13)
man.adj.pvals <- pvals[order(pvals)][1:13] * (m/i)
man.adj.pvals

compare_list$man.adj.pvalue <- man.adj.pvals
compare_list

org.pvalue
<dbl>
adj.pvalue
<dbl>
man.adj.pvalue
<dbl>
201909_at 8.453679e-23 7.433320e-19 7.433320e-19
214218_s_at 5.429425e-20 2.387047e-16 2.387047e-16
206700_s_at 1.044354e-19 3.061001e-16 3.061001e-16
205000_at 3.691745e-19 8.115378e-16 8.115378e-16
204409_s_at 1.224535e-16 2.153468e-13 2.153468e-13
206624_at 5.053589e-12 7.406034e-09 7.406034e-09
203974_at 9.187018e-09 1.154021e-05 1.154021e-05
200933_x_at 1.385484e-08 1.522820e-05 1.522820e-05
201018_at 6.702439e-07 6.548283e-04 6.548283e-04
203992_s_at 3.209897e-06 2.822463e-03 2.822463e-03
204061_at 1.293600e-05 1.034056e-02 1.034056e-02
211149_at 3.493280e-05 2.559701e-02 2.559701e-02
201029_s_at 3.838790e-05 2.596499e-02 2.596499e-02

p.adjust.methods

[1] "holm"       "hochberg"   "hommel"     "bonferroni"
[5] "BH"         "BY"         "fdr"        "none"

sum( p.adjust(pvals, method="bonferroni") < 0.05 )  # 10 很保守

clipper

参考资料：

Clipper: p -value-free FDR control on high-throughput data from two conditions
斯隆奖获得者李婧翌：AI+X并非总是有效，生物数据量小、噪音大，可解释性是关键
李婧翌团队提出针对高通量数据进行富集或差异分析的统计学方法Clipper，无需计算p值即可实现对假发现率的控制
Genome Biology | 李婧翌/李蔚团队合作报道流行的差异表达分析软件在人群数据上有极高的假发现率
Introduction to Clipper: p-value-free FDR control on high-throughput data from two conditions
https://github.com/JSB-UCLA/Clipper/blob/master/vignettes/

虽然上述统计学方法被广泛使用，但一个常见的问题是，生物数据往往样本量过小，很多针对新数据类型开发的计算方法无法或者很难给出正确的p值。李婧翌研究团队发现，非参数秩和检验在评估中表现最好，像DESeq2和edgeR这类基于参数模型假设的方法不能很好地控制假发现率的一个原因是实际的群体数据并不符合它们假设的分布。并提出了一种新的计算方法，使用户能在无需计算p值的情况下直接控制高通量数据分析中的假阳性率。

Clipper的优势在于无需对数据分布进行参数化的假设，从而适用于样本量小的情况，避免了p值计算的难点，并节省了p值计算的时间

根据文章的描述，Clipper可以应用于多个高通量数据分析场景

这里我们将挑转录组常用“DEG analysis from RNA-seq data”进行介绍

Analyzing high-throughput data with Clipper

https://github.com/JSB-UCLA/Clipper/blob/master/vignettes/Clipper.pdf

library(devtools)
install_github("JSB-UCLA/Clipper")

library(Clipper)

General pipeline

下面的代码中我们使用的是包内置数据集，无需另外下载

Enrichment analysis

测试数据集特征：there are 10,000 features and the first 1000 features are interesting.

exp_e[c(1:3, 1001:1003), ]

back_e[c(1:3, 1001:1003), ]

           [,1]        [,2]      [,3]
[1,]  4.9914685  5.26705194  3.945790
[2,]  4.8837421  4.65166005  6.889625
[3,]  5.0160066  5.00388853  4.830051
[4,] -0.3823919  1.70146747 -0.335157
[5,] -0.9822577  0.75032253  1.722063
[6,] -0.3423613 -0.06262634  1.129383
           [,1]        [,2]       [,3]
[1,] -0.6264538  0.18364332 -0.8356286
[2,]  1.5952808  0.32950777 -0.8204684
[3,]  0.4874291  0.73832471  0.5757814
[4,]  0.7391149  0.38660873  1.2963972
[5,] -0.8035584 -1.60262567  0.9332510
[6,]  1.8060893 -0.05650363  1.8859113

re1 <- Clipper(score.exp = exp_e, score.back = back_e,
               analysis = "enrichment",
               FDR = c(0.01, 0.05, 0.1))

names(re1)

[1] "contrast.score"       "contrast.score.value"
[3] "FDR"                  "contrast.score.thre" 
[5] "discoveries"          "q"

分析结果：Clipper returns a list and its component discovery is a list of indices of identified interesting discoveries corresponding to the FDR threshold(s)

#indices of identified interesting genes with the second input FDR threshold (0.05)
re1$discoveries[[2]][1:5]

[1] 1 2 3 4 5

lapply(re1$discoveries, length)

re1$FDR

[[1]]
[1] 971

[[2]]
[1] 1041

[[3]]
[1] 1104

[1] 0.01 0.05 0.10

We can then calculate the resulting false discovery proportion (FDP) and power:

#FDP (the first 1000 genes are true positives)
trueid <- 1:1000

sum(!re1$discoveries[[1]] %in% trueid)
length(re1$discoveries[[1]])

sum(!re1$discoveries[[1]] %in% trueid)/length(re1$discoveries[[1]])

[1] 5
[1] 971
[1] 0.005149331

#power
sum(re1$discoveries[[1]] %in% trueid)/length(trueid)

[1] 0.966

Differential analysis

测试数据集特征：there are 10,000 features and the fifirst 2000 features are interesting

exp_d[c(1:3, 1001:1003), ]

back_d[c(1:3, 1001:1003), ]

         replicates1 replicates2 replicates3
gene1             40          43          29
gene2             27          46          39
gene3             33          27          35
gene1001           6           9           4
gene1002           2           1           2
gene1003           1           3           5
         replicates1 replicates2 replicates3
gene1             19          28          27
gene2             29          18          29
gene3             17          17          13
gene1001          24          28          23
gene1002          15          16          18
gene1003           8          12           6

re2 <- Clipper(score.exp = exp_d, score.back = back_d, analysis = "differential",FDR = c(0.01, 0.05, 0.1))

names(re2)

[1] "contrast.score"       "contrast.score.value"
[3] "FDR"                  "contrast.score.thre" 
[5] "discoveries"          "q"

分析结果：Clipper still returns a list and its component discovery is a list of indices of identified interesting discoveries corresponding to the FDR threshold(s):

#indices of identified interesting genes with the second input FDR threshold (0.05)
re2$discoveries[[2]][1:5]

[1] 1 2 3 5 6

We can then calculate the resulting false discovery proportion (FDP) and power:

#FDP (the first 2000 genes are true positives)
trueid <- 1:2000
sum(!re2$discoveries[[1]] %in% trueid)/length(re2$discoveries[[1]])

#power
sum(re2$discoveries[[1]] %in% trueid)/length(trueid)

[1] 0.004234298
[1] 0.7055

DEG identifification

Users are recommended to normalize and log-tranform read count matrices before inputing them to Clipper

The following example uses edgeR to preprocess raw read count matrices as an example.Users can also use their preferred normalization methods.

library(edgeR)
dataurl = 'http://www.stat.ucla.edu/~jingyi.li/data/Clipper/'
count1 = readRDS(url(paste0(dataurl, 'simcount1.rds')))
count2 = readRDS(url(paste0(dataurl, 'simcount2.rds')))
############ use edgeR to normalize ############
r1 = ncol(count1)  # 3
r2 = ncol(count2)  # 3
cond_idx = rep(2, r1 + r2)
cond_idx[1:r1] = 1
cond_idx

dat = cbind(count1, count2)
cond_idx = factor(cond_idx)
y <- DGEList(counts=dat,group=cond_idx)
y <- calcNormFactors(y)
count_norm = cpm(y)

Next we apply Clipper to preprocessed gene expression matrices. Log-transformation is recommendedbecause empirical evidence indicates that log-transformed read counts are more likely to satisfy Clipper’s assumption for FDR control than read counts without log-transformation.

##### apply clipper to log2-transformed matrices with target FDR = 0.05
re_clipper = Clipper(
  score.exp = log(base = 2, count_norm[,1:r1] + 1),
  score.back = log(base = 2, count_norm[,-(1:r1)] + 1),
  FDR = 0.05,
  analysis = "differential")

head(re_clipper$discoveries[[1]]) ###### identified DEG indices at FDR = 0.05

[1]  676 2173 3103 3105 3484 3490

手动查看阈值，和通过阈值的结果对应分数

re_clipper$contrast.score.thre

re_clipper$contrast.score.value[re_clipper$discoveries[[1]]]

[1] 6.122079
 [1]  6.226567  6.754237  7.691055  7.557135  7.713961  6.427339
 [7]  6.146831  6.644286  6.122079  6.131032  8.234947  7.585539
[13]  6.404031  6.796154  7.091714  6.388610 10.743190  7.561171
[19]  6.695440  6.452022  9.600434

将结果与edgeR分析结果比较

# edgeR
#1 表达矩阵
#2 实验设计
design <- model.matrix(~0+rev(cond_idx))
design
rownames(design) <- colnames(dat)
colnames(design) <- levels(factor(cond_idx))
design

#3 计算线性模型的参数
DEG <- estimateGLMCommonDisp(y,design)
DEG <- estimateGLMTrendedDisp(DEG, design)
DEG <- estimateGLMTagwiseDisp(DEG, design)
#4 拟合线性模型
fit <- glmFit(DEG, design)
lrt <- glmLRT(fit, contrast=c(1,-1)) 
#5 提取过滤差异分析结果
DEG_edgeR <- as.data.frame(topTags(lrt, n=nrow(DEG)))
head(DEG_edgeR)

# 设定阈值，筛选显著上下调差异基因
fc <- 2.0
p <- 0.05
DEG_edgeR$regulated <- ifelse(DEG_edgeR$logFC>log2(fc)&DEG_edgeR$FDR<p,"up",
                              ifelse(DEG_edgeR$logFC<(-log2(fc))&DEG_edgeR$FDR<p,"down","normal"))

结果比较：

韦恩图

# edgeR
table(DEG_edgeR$regulated)  # 121 DEGs
# clipper
length(re_clipper$discoveries[[1]])

  down normal     up 
    65  52255     56 
[1] 21

edgeR_DEGs <- rownames(DEG_edgeR[DEG_edgeR$regulated != "normal",])
clipper_DEGs <- rownames(dat[re_clipper$discoveries[[1]],])

##画韦恩图
library(VennDiagram)
venn.diagram(list( edgeR=edgeR_DEGs, 
                   clipper=clipper_DEGs), 
             fill=c("red","blue"),imagetype = "png",
             main.cex = 2,pattern = 'Euler_3set_scaled',
             filename="./Venn.png")

散点图

# 散点图
# 相关性
# 交集基因 根据venn图结果其实就是clipper的discoveries
library(tidyverse)
overlap_genes_df <- data.frame(genes = clipper_DEGs)
overlap_genes_df <- overlap_genes_df %>% 
  mutate(logfc_rank = rank(-abs(DEG_edgeR$logFC[match(.$genes,rownames(DEG_edgeR))])),  # 绝对值从大到小
         edgeR_Pvalue_rank = rank(DEG_edgeR$PValue[match(.$genes,rownames(DEG_edgeR))]),  # 从小到大
         clipper_contrast_score_rank = rank(-re_clipper$contrast.score.value[re_clipper$discoveries[[1]]]))  # 从大到小

library(ggplot2)
# basic scatterplot
scatter_plot1 <- ggplot(overlap_genes_df, aes(x=edgeR_Pvalue_rank, y=logfc_rank)) + 
    geom_point()

scatter_plot1+labs(caption =  str_c("Spearman Cor",cor(overlap_genes_df$logfc_rank,overlap_genes_df$edgeR_Pvalue_rank),sep = "="))

library(ggplot2)
# basic scatterplot
scatter_plot1 <- ggplot(overlap_genes_df, aes(x=clipper_contrast_score_rank, y=logfc_rank)) + 
    geom_point()

scatter_plot1+labs(caption =  str_c("Spearman Cor",cor(overlap_genes_df$logfc_rank,overlap_genes_df$clipper_contrast_score_rank),sep = "="))

还是可以看出基本的logFC绝对值和clipper打分的相关系数比edgeR的P值高一点，可能因为我们这里的基因很少，只有21个所以差别很小

下面是作者文章的相关结果：

作者将clipper方法和我们常用的方法在不同方面做了比较，我们在这里主要还是介绍工具，使用的是内置数据集，顺带对结果进行简单比较验证，感兴趣的同学可以拿自己的数据参考作者原文，在自己需要的应用方向上进行比较

Difffferential gene expression anlaysis with Clipper

有的时候比如前面那样需要进行标准化，或者有批次效应需要处理，还是可以基于常用的DESeq2和edgeR先进行分析，再使用Clipper控制FDR

在项目仓库中也有这么一个示例文档

https://github.com/JSB-UCLA/Clipper/blob/master/vignettes/DEG_extra_covariate.pdf

可惜的是文档作者提供的示例文件，现在无法访问

http://jsb1data.stat.ucla.edu/repository/xinzhou/example_data.rds

但我们这里仍直接截图带大家看看示例代码是如何工作的

首先准备好自己的数据集，以及感兴趣的分组和批次，且因为这里是使用scDesign3产生的示例数据，真正的DEGs我们已知以便后面计算FDR和power

几个包装好的函数：使用DESeq2、edgeR进行差异分析（批次效应并不和我们之前谈到的那样提前去除，而是作为实验设计design的一部分），筛选出DEGs以及计算FDR和power

使用这些函数处理数据集获得相关结果

接下来，介绍了两种不同的方法创建null dataset，并在此基础上开始引入Clipper结合original和null的数据集来控制FDR，在使用Clipper前还需考虑其前提假设是否成立：来自两个数据集（original和null）的输入数据（这里的指，-log转换后的p值）对所有非DEGs是否遵循相同的分布。如果违反了这个假设，还需有一个额外的标准化步骤。

a null dataset is a dataset where the expressions of the genes are from the same distribution as the original dataset,
and the covariate of interest (cell type in this example) does not have any effffect on the expression.