Closed ixxmu closed 1 year ago
2021年NC发文《Confronting false discoveries in single-cell differential expression》,评测了当前单细胞转录组数据差异分析的14种方法,例如pseudobulks,Wilcox,DESeq2和MAST等。文章指出pseudobulks方法要优于其他single-cell分析方法,并指出现在的很多发表的差异分析方法是错误的,会有太多的假阳性。
根据我的理解来说的话,由于单细胞数量远大于传统Bulk RNA-seq的样本量,因此Limma和DESeq2是非常不适合单细胞数据的,极易造成假阳性问题,具体原理可参考熊写的:熊读文献|别再用DEseq2和edgeR进行大样本差异表达基因分析了。所以我是推荐使用FindAllMarkers()函数的Wilcox.test算法。至于MAST算法,本身细胞量多的话FindAllMarkers()函数的Wilcox.test算法已经很花资源和时间了,MAST算法固然是会提升差异基因的准确性,但是会进一步消耗极大的时间和资源。关于pseudobulks的话,非常合适当前单细胞大样本量的数量情况,一是能够加速运算,二是大大减少假阳性问题。难点在于代码层面实现比较困难,一是要得到指定分组和/或指定细胞类型的pseudobulks表达矩阵,二是进行类似Bulk RNA-seq的Limma或DESeq2差异分析。具体来说,pseudobulk差异分析需要足够的两两比较的样本数量,否则结果会出现较强的假阴性现象。如果用户的样本量不大时,一是建议使用传统的FindAllMarkers()函数的Wilcox.test算法;二是使用下文的MetaCell结合DESeq2差异分析。
在此,本文详细介绍如何实现pseudobulks,DESeq2及MetaCell,包装了3个函数,以方便用户使用。本文目录包括:
值得介绍的是MetaCell算法,张泽民老师的文章有提到:To gain a concise description of the observed phenotypic diversity, we used our SEACells algorithm (31) (see methods) to calculate metacells separately within each sample. The metacell approach (24) aggregates similar single cells into a set of distinct, highly granular cell states such that any differences between cells within a metacell are likely technical rather than biological. Aggregating counts within each metacell generates a robust , full transcriptomic quantification for each cell state, thus mitigating noise and sparsity in scRNA-seq data. An added benefit of metacells is the enumeration of observed cell states that are more amenable to comparison.
这个MetaCell的观念就是,把相似的细胞看做一类,以“浓缩”后的矩阵进行下游分析。MetaCell结合DESeq2差异分析找Markers或Signature的方法也非常新颖。此外,MetaCell之后的表达矩阵可以接harmony,CCA等整合分析,张老师的文章证明这种策略可以更好的整合单细胞大样本数据。
拓展阅读:
library(Seurat)
library(readr)
library(dplyr)
library(stringr)
#### 1.读入数据
seurat.data = read_rds("./pseudoBulk_test.data.rds")
Idents(seurat.data) = "celltype"
DimPlot(seurat.data)
table(seurat.data$celltype)
# T_cell NK Myeloid B_cell Platelet Hpc
# 1500 1500 1500 1500 244 20
table(seurat.data$orig.ident)
# A1 CT3 HC2 P2 SC4
# 1918 949 1328 1048 1021
seurat.data包5个样本,5种细胞类型。
这里我针对pseudobulks分析写了2个自编函数,保存在custom_seurat_functions_for_pseudobulks.R
文件中:
测试数据、自编函数和custom_seurat_functions_for_pseudobulks.R见文末。
source("custom_seurat_functions_for_pseudobulks.R")
# By samples
res.pseudo = pseudoBulk(Seurat.data = seurat.data,
group = "orig.ident",
celltype = NA )
head(res.pseudo)
meta.data = data.frame(sampleID = colnames(res.pseudo),
row.names = colnames(res.pseudo))
列名是样本名称。
res.pseudo2 = pseudoBulk(Seurat.data = seurat.data,
group = "orig.ident",
celltype = "celltype" )
head(res.pseudo2)
meta.data2 = data.frame(sampleID = str_split(names(res.pseudo2),"\\.",simplify = T)[,1],
celltype = str_split(names(res.pseudo2),"\\.",simplify = T)[,2],
row.names = colnames(res.pseudo2))
列名是样本名称+细胞类型
head(meta.data2)
# sampleID celltype
# HC2.NK HC2 NK
# HC2.Myeloid HC2 Myeloid
# HC2.B_cell HC2 B_cell
# HC2.T_cell HC2 T_cell
# HC2.Platelet HC2 Platelet
# HC2.Hpc HC2 Hpc
根据单细胞得到的pseudobulks表达矩阵,本质上类似Bulk RNA-seq的表达矩阵,因此推荐使用DESeq2进行差异分析,这里我写了一个DESeq2差异分析函数,便于进行差异分析:
get.DESeq2 = function(input.data, #输入目标表达矩阵
group_list, #设置分组,因子数据
contrast = NA,
NA.remove = TRUE, #DESeq2会产生NA值, NA.remove = TRUE移除NA
...){
if(is.na(contrast)){
contrast = rev(levels(group_list))
}
# 第一步,构建DESeq2的DESeq对象
colData <- data.frame(row.names=colnames(input.data),group_list=group_list)
dds <- DESeqDataSetFromMatrix(countData = input.data,colData = colData,design = ~ group_list)
# 第二步,进行差异表达分析
dds2 <- DESeq(dds)
# 提取差异分析结果,trt组对untrt组的差异分析结果
table(group_list)
tmp <- results(dds2,contrast=c("group_list",contrast))
# tmp <- results(dds2)
DEG_DESeq2 <- as.data.frame(tmp[order(tmp$padj),])
head(DEG_DESeq2)
if(NA.remove){
# 去除差异分析结果中包含NA值的行
DEG_DESeq2 = na.omit(DEG_DESeq2)
}
DEG_DESeq2$Gene = row.names(DEG_DESeq2)
return(DEG_DESeq2)
}
当然,上面的DESeq2函数也适用于Bulk RNA-seq数据。
> head(res.pseudo)
HC2 P2 A1 SC4 CT3
AL627309.1 3 3 2 0 0
AL669831.5 41 27 56 0 0
LINC00115 32 18 35 1 2
FAM41C 15 2 19 7 4
NOC2L 242 139 556 63 53
KLHL17 17 7 19 2 1
假设样本HC2和P2是一组,需要与A1、SC4、CT3进行比较:
input.data = res.pseudo
group_list = ifelse(colnames(input.data) %in% c("HC2","P2"),"case","ctrl") %>%
factor(levels = c("ctrl","case"))
一行代码完成差异分析:
example_1 = get.DESeq2(input.data = input.data,
group_list = group_list,
NA.remove = T)
head(example_1)
# baseMean log2FoldChange lfcSE stat pvalue padj Gene
# COX1 11338.188 -30.00000 4.407417 -6.806708 9.985702e-12 8.540771e-08 COX1
# COX3 7380.949 -29.49563 4.407420 -6.692265 2.197419e-11 9.397263e-08 COX3
# ND4 4886.638 -28.79662 4.407426 -6.533659 6.418179e-11 1.028435e-07 ND4
# ND2 4271.731 -28.74874 4.407428 -6.522793 6.901006e-11 1.028435e-07 ND2
# ND1 4188.761 -28.71937 4.407429 -6.516128 7.214560e-11 1.028435e-07 ND1
# COX2 8580.481 -28.84833 4.407419 -6.545401 5.933587e-11 1.028435e-07 COX2
head(res.pseudo2)
head(meta.data2)
# sampleID celltype
# HC2.NK HC2 NK
# HC2.Myeloid HC2 Myeloid
# HC2.B_cell HC2 B_cell
# HC2.T_cell HC2 T_cell
# HC2.Platelet HC2 Platelet
# HC2.Hpc HC2 Hpc
例如计算每个样本中Myeloid的markers:
实际上我并不建议这样做,因为1 vs. 多 得到的差异基因不可靠。
input.data.list = split(row.names(meta.data2),
meta.data2$sampleID)
Myeloid_DEG_list = lapply(input.data.list, function(sampleID){
input.data = res.pseudo2[,sampleID]
#髓系 vs. Others
group_list = ifelse(grepl(colnames(input.data),pattern = "Myeloid"),"case","ctrl") %>%
factor(levels = c("ctrl","case"))
if (length(unique(group_list))==2) {
tem.DEGs = get.DESeq2(input.data = input.data,
group_list = group_list,
NA.remove = T)
# head(tem.DEGs)
return(tem.DEGs)
}
})
str(Myeloid_DEG_list)
names(Myeloid_DEG_list)
# [1] "A1" "CT3" "HC2" "P2" "SC4"
设置Cutoff找差异基因:
Myeloid.filter = lapply(Myeloid_DEG_list, function(DEG){
DEG$Gene[DEG$log2FoldChange>1&DEG$pvalue<0.05]
})
str(Myeloid.filter)
# List of 5
# $ A1 : chr [1:139] "DUSP1" "TSPO" "NUP214" "SDCBP" ...
# $ CT3: chr [1:230] "FCN1" "DUSP6" "PSAP" "RETN" ...
# $ HC2: chr [1:145] "TYMP" "TNFSF13B" "TSPO" "S100A6" ...
# $ P2 : chr [1:341] "S100A11" "AP1S2" "S100A12" "CPVL" ...
# $ SC4: chr [1:43] "PPT1" "CTSS" "FCGR2A" "NAGK" ...
可视化交集:
#交集
library(UpSetR)
upset(fromList(Myeloid.filter))
可以看到只有仅仅15个基因是所有样本都有的差异基因,我们看一下是哪15个差异基因:
tem.Myeloid = Myeloid.filter[[1]]
for (sampleID in names(Myeloid.filter)[-1]) {
tem.Myeloid = intersect(tem.Myeloid,Myeloid.filter[[sampleID]])
}
tem.Myeloid
# [1] "TSPO" "FCGRT" "PYCARD" "BLVRB" "LGALS1" "PSAP" "FTL" "PPT1" "BRI3" "NPC2"
# [11] "GRN" "NEAT1" "CTSS" "AP1S2" "CFD"
可以看到这些基因大部分还是髓系细胞的Markers,可以说pseudobulks可以减少假阳性错误,然而代价是引入了假阴性问题。
那么如何解决这个“假阴性问题”?
答案是增加样本量,例如:5个样本的Myeloid vs. Others:
input.data = res.pseudo2
group_list = ifelse(grepl(colnames(input.data),pattern = "Myeloid"),"case","ctrl") %>%
factor(levels = c("ctrl","case"))
# [1] ctrl case ctrl ctrl ctrl ctrl case ctrl ctrl ctrl ctrl ctrl case ctrl ctrl ctrl ctrl ctrl ctrl
# [20] ctrl case ctrl ctrl ctrl ctrl ctrl ctrl ctrl case ctrl
差异分析:
Myeloid_DEG = get.DESeq2(input.data = input.data,
group_list = group_list,
NA.remove = T)
head(Myeloid_DEG)
设置Cutoff找差异基因:
Myeloid_cutoff = Myeloid_DEG[Myeloid_DEG$log2FoldChange>1.5&Myeloid_DEG$padj<0.05,] %>%
arrange(desc(log2FoldChange))
str(Myeloid_cutoff)
length(Myeloid_cutoff$Gene)
#619
这样就能得到较为稳定的细胞类型差异基因。
总结来说,pseudobulk差异分析需要足够的两两比较的样本数量,否则结果会出现较强的假阴性现象。如果用户的样本量不大时,一是建议使用传统的FindAllMarkers()函数的Wilcox.test算法;二是使用下面的MetaCell结合DESeq2差异分析。
对每个样本进行MetaCell分析:
MetaCells函数为自编函数,存储于custom_seurat_functions_for_pseudobulks.R。
https://mp.weixin.qq.com/s/gWmn6j0iDXHLnSfuzE25eg