富集分析的p值是怎么算出来的？

富集分析的p值是怎么算出来的？ by YuLabSMU

很久之前写过的文章，被挖出来，那么我也就顺道搬运过来了。

经常看到一些饼图，描述某些事物的组成，比如说有钱人的学历分布，然后我们可以看到高学历所占比例并不高，根据这个比例下结论通常是错的，这些比例说明不了问题，如果把各种学历在总体人口中的分布做为背景进行考虑的话，你就会发现学历还是有点用的。当我们用组学测定了一大堆分子之后，我们希望站在更高的角度去看这些分子和那些生物学过程相关。那么通常各种注释，对这些基因/蛋白进行分类，那么从分类的比例上，是不能草率下结论，正如上面有钱人学历分布的例子一样。我们需要把总体的分布考虑进去。和某个注释/分类是否有相关性，把基因分成属于这一类，和不属于这一类两种，这就好比经典统计学中的白球和黑球的抽样问题。也可以列一个２x２的表，进行独立性分析。以文章GeneExpression in Ovarian Cancer Reflects Both Morphology and BiologicalBehavior, Distinguishing Clear Cell from Other Poor-Prognosis OvarianCarcinomas^[1]所鉴定的差异基因为例。

73个差异基因的Symbol，我把它转为 entrezgeneID得到57个（漏掉的不管它，只是做为一个例子）:

> eg [1] "7980"   "3081"   "3162"   "3059"   "1545"   "1917"   "6696"   "5797"   [9] "6648"   "10397"  "6781"   "5817"   "1282"   "1284"   "6948"   "7077"  [17] "5744"   "8566"   "1368"   "1474"   "11015"  "3306"   "728441" "2678"  [25] "4286"   "3929"   "5095"   "2064"   "1428"   "6590"   "3569"   "2745"  [33] "3912"   "978"    "5950"   "6539"   "9445"   "5004"   "9971"   "7453"  [41] "2719"   "1410"   "1311"   "4653"   "4162"   "5358"   "3484"   "3486"  [49] "2261"   "307"    "1672"   "4837"   "22795"  "486"    "4118"   "3915"  [57] "10140"

下面测试一下这些基因和化学刺激响应的相关性。

goid <- “GO:0042221” # response to chemical stimulus

那么做为背景，总体基因为N，属于“化学刺激响应”这个分类的基因有M个。

allgeneInCategory <- unique(get(goid, org.Hs.egGO2ALLEGS))M <- length(allgeneInCategory)N <- length(mappedkeys(org.Hs.egGO))

样本的大小是n，属于“化学刺激响应”这个分类的基因有k个。

n <- length(eg)k <- sum(eg %in% allgeneInCategory)

白球黑球问题，最简单的莫过于用二项式分布，从总体上看，要拿到一个基因属于“化学刺激响应”这个分类的概率是M/N。那么现在抽了n个基因，里面有k个基于这个分类，p值为：

> 1-sum(sapply(0:k-1, function(i) choose(n,i) * (M/N)^i * (1-M/N)^(n-i)))[1] 8.561432e-10

二项式分布，是有放回的抽样，你可以多次抽到同一基因，这是不符合的。所以这个计算只能说是做为近似的估计值，无放回的抽样，符合超几何分布，通过超几何分布的计算，p值为：

> phyper(k-1,M, N-M, n, lower.tail=FALSE)[1] 7.879194e-10

如果用2x2表做独立性分析，表如下：

> d <- data.frame(gene.not.interest=c(M-k, N-M-n+k), gene.in.interest=c(k, n-k))> row.names(d) <- c("In_category", "not_in_category")> d                gene.not.interest gene.in.interestIn_category                  2613               28not_in_category             15310               29

这个也有很多方法可以做检验，经典的有卡方检验和fisher's exact test。如果用卡方检验来做。

> chisq.test(d, )
    Pearson's Chi-squared test with Yates' continuity correction
data:  d X-squared = 51.3849, df = 1, p-value = 7.592e-13

对于２x２表来说，卡方检验通常也只能做为近似估计值，特别是当samplesize或expected all count比较小的时候，计算并不准确。fisher's exacttest，名副其实，真的就比较exact，因为它使用的是超几何分布来计算p值。这也是为什么fisher'sexact test和超几何模式计算的p-值是一样的，

> fisher.test(d)
    Fisher's Exact Test for Count Data
data:  d p-value = 7.879e-10alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.1013210 0.3089718 sample estimates:odds ratio  0.1767937

通常各种软件做GO富集性分析，都是使用超几何分布进行计算。IPA软件则是使用fisher's exact test来检验基因在某个网络中是否富集。从这个例子上看，chi-squaredtest的表现最差，binomial做为p值的近似估计还是不错的，因为计算较为简单。富集性分析应用范围非常广，从Disease Ontology^[2],Gene Ontology, KEGG^[3],到Reactome Pathway^[4]等等。

补充说明一下，超几何分布是偏态的，所以fisher.test默认的双侧检验，其实是使用单侧来计算p值，我们可以对上面的数据进行测试，使用双侧和单侧的p值是一样的，不过会影响对置信区间的估计。

> fisher.test(d, alternative="two.sided")

    Fisher's Exact Test for Count Data

data:  d
p-value = 7.879e-10
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.1013210 0.3089718
sample estimates:
odds ratio 
 0.1767937 

> fisher.test(d, alternative="less")

    Fisher's Exact Test for Count Data

data:  d
p-value = 7.879e-10
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.000000 0.283761
sample estimates:
odds ratio 
 0.1767937

再继续补充一点，clusterProfiler不会欺骗你！当我写下“很多软件会干错误的事情，给你更低的p值，但clusterProfiler不会”，自豪感满满！

References

[1] Gene Expression in Ovarian Cancer Reflects Both Morphology and Biological Behavior, Distinguishing Clear Cell from Other Poor-Prognosis Ovarian Carcinomas: http://cancerres.aacrjournals.org/content/62/16/4722
[2] Disease Ontology: http://bioconductor.org/packages/devel/bioc/html/DOSE.html
[3] Gene Ontology, KEGG: http://bioconductor.org/packages/devel/bioc/html/clusterProfiler.html
[4] Reactome Pathway: http://bioconductor.org/packages/devel/bioc/html/ReactomePA.html

ixxmu / mp_duty

富集分析的p值是怎么算出来的？ #1285

富集分析的p值是怎么算出来的？ by YuLabSMU