一文搞定单细胞基因集评分

一文搞定单细胞基因集评分 by Biomamba 生信基地

前面我们已经学习过几种基因集打分/富集方法，但是每个都是独立讲解，在不同的数据集进行演示，难以领会方法的异同点。这里我给技术支持安排了一个小作业，以同一个数据集、同一个基因集为例，进行AddModuleScore、ssGSEA、GSVA、PercentageFeatureSet、AUCell五种基因集打分方法的横向比较，一起去学习吧~

下面这五个方法的箱线图几乎是一个模子刻出来的，只是PercentageFeatureSet的算法会让数值更离散些：

1. 准备工作

1.1 单细胞测序数据集

本教程使用经典的pbmc3k数据集，代码和测试文件可以在此获取:

链接：https://pan.baidu.com/s/1WnHgrbg65n-zWn2HozPLNw?pwd=jgpk

下面预处理的步骤看不懂的同学可以先补一下基础知识:

手把手教你做单细胞测序数据分析（三）——单样本分析

set.seed(123456)dir.create('Results')
library(Seurat)

## Attaching SeuratObject

library(ggplot2)library(png)sce = Read10X('./pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19/')
library(dplyr)

## ## Attaching package: 'dplyr'## The following objects are masked from 'package:stats':## ##     filter, lag## The following objects are masked from 'package:base':## ##     intersect, setdiff, setequal, unio

sce_final <- sce %>% CreateSeuratObject(min.cells = 3, min.features = 200) %>% PercentageFeatureSet(pattern = '^MT',col.name = 'percent.mt')select_c <- WhichCells(sce_final,expression = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)sce_final = subset(sce_final,cells = select_c)dim(sce_final)

## [1] 13714  2626

# [1] 13714  2626
sce_final <- sce_final %>%   NormalizeData() %>%   FindVariableFeatures() %>%   ScaleData() %>% RunPCA() %>%   FindNeighbors(dims = 1:10) %>%   FindClusters(resolution = 0.5) %>%   RunTSNE(dims = 1:10) %>%   RunUMAP(dims = 1:10)

## Centering and scaling data matrix## PC_ 1 ## Positive:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW ##     CD247, GIMAP5, AQP3, CCL5, SELL, GZMA, TRAF3IP3, CST7, MAL, ITM2A ##     HOPX, GIMAP7, MYC, BEX2, GZMK, ETS1, ZAP70, TNFAIP8, RIC3, LYAR ## Negative:  CST3, TYROBP, LST1, AIF1, FTL, LYZ, FTH1, FCN1, S100A9, TYMP ##     FCER1G, CFD, LGALS1, S100A8, LGALS2, CTSS, SERPINA1, SPI1, IFITM3, CFP ##     PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD ## PC_ 2 ## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, CD74, HLA-DRB1 ##     HLA-DPB1, HLA-DMA, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB ##     BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 ## Negative:  NKG7, PRF1, CST7, GZMA, GZMB, FGFBP2, GNLY, CTSW, B2M, SPON2 ##     GZMH, CCL4, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX ##     TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, ACTB ## PC_ 3 ## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1, HLA-DPB1, CD74, MS4A1, HLA-DRB1, HLA-DRA ##     HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, CD37, HLA-DMA, HVCN1, FCRLA, MALAT1 ##     IRF8, PLAC8, BLNK, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B ## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, HIST1H2AC ##     CLU, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 ##     NGFRAP1, F13A1, RUFY1, SEPT5, TSC22D1, MPP1, CMTM5, MYL9, RP11-367G6.3, GP1BA ## PC_ 4 ## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, MS4A1, CD74, HLA-DPB1, HLA-DPA1, HLA-DRB1, TCL1A ##     HLA-DQA2, HLA-DRA, LINC00926, HLA-DRB5, HIST1H2AC, PF4, SDPR, GZMB, PPBP, GNG11 ##     HLA-DMA, HLA-DMB, HVCN1, SPARC, FCRLA, GP9, FGFBP2, PTCRA, CA2, CD9 ## Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL ##     AQP3, FYB, CD2, CD14, GIMAP4, LGALS2, ANXA1, CD27, RBP7, GIMAP5 ##     FCN1, LYZ, S100A11, S100A12, MS4A6A, FOLR3, TMSB4X, TRABD2A, AIF1, NELL2 ## PC_ 5 ## Positive:  GZMB, S100A8, NKG7, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 ##     S100A9, GZMH, LGALS2, CCL3, XCL2, CTSW, CD14, S100A12, CLIC3, RBP7 ##     CCL5, MS4A6A, FOLR3, GSTP1, TYROBP, TTC38, AKR1C3, IGFBP7, XCL1, HOPX ## Negative:  LTB, IL7R, VIM, CKB, AQP3, MS4A7, CYTIP, RP11-290F20.3, HMOX1, PTGES3 ##     CD27, HN1, MAL, LILRB2, GDI2, CORO1B, CD2, ANXA5, TUBA1B, FAM110A ##     TRADD, PPA1, ATP1A1, CCDC109B, ABRACL, CTD-2006K23.1, FYB, WARS, TRAF3IP3, IFITM2## Computing nearest neighbor graph## Computing SNN## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck## ## Number of nodes: 2626## Number of edges: 95233## ## Running Louvain algorithm...## Maximum modularity in 10 random starts: 0.8722## Number of communities: 9## Elapsed time: 0 seconds## 11:34:56 UMAP embedding parameters a = 0.9922 b = 1.112## 11:34:56 Read 2626 rows and found 10 numeric columns## 11:34:56 Using Annoy for neighbor search, n_neighbors = 30## 11:34:56 Building Annoy index with metric = cosine, n_trees = 50## 0%   10   20   30   40   50   60   70   80   90   100%## [----|----|----|----|----|----|----|----|----|----|## **************************************************|## 11:34:56 Writing NN index file to temp file C:\Users\53513\AppData\Local\Temp\RtmpiIxgPz\file39606f2345ad## 11:34:56 Searching Annoy index using 1 thread, search_k = 3000## 11:34:57 Annoy recall = 100%## 11:34:57 Commencing smooth kNN distance calibration using 1 thread## 11:34:58 Initializing from normalized Laplacian + noise## 11:34:58 Commencing optimization for 500 epochs, with 104830 positive edges## 11:35:06 Optimization finished

new.cluster.ids <- c("Naive CD4 T", "CD14_Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A_Mono", "NK", "DC", "Platelet")names(new.cluster.ids) <- levels(sce_final)sce_final <- RenameIdents(sce_final, new.cluster.ids)
sce_final$celltype <- Idents(sce_final)sce_final$celltype <- factor(sce_final$celltype,levels = c("Naive CD4 T","Memory CD4 T","CD8 T","B","NK","CD14_Mono","FCGR3A_Mono","DC", "Platelet"))
my9color <- c('#5470c6','#91cc75','#fac858','#ee6666','#73c0de','#3ba272','#fc8452','#9a60b4','#ea7ccc')p <- DimPlot(sce_final, reduction = "umap", label = TRUE, pt.size = 0.5,label.size = 5,cols = my9color) + NoLegend() + NoAxes()ggsave('./Results/umap.png',width = 5,height = 5)

1.2 基因集

选择的是MSigdb数据库中C7: immunologic signature gene sets中的GSE10325_CD4_TCELL_VS_BCELL_UP基因集（200个基因）

library(readxl)CD4_TCELL_VS_BCELL_UP <- openxlsx::read.xlsx("./CD4_TCELL_VS_BCELL_UP.xlsx")unique(CD4_TCELL_VS_BCELL_UP$Gene)

##   [1] "AAK1"      "ACVR1"     "AK5"       "AKTIP"     "ANXA1"     "AP3M2"    ##   [7] "APBA2"     "AQP3"      "ARL4C"     "ARNTL"     "ARRB1"     "ATP10A"   ##  [13] "ATP1A1"    "ATP2B4"    "ATP6V0E2"  "BCL11B"    "BEX3"      "BLMH"     ##  [19] "BNIP3"     "C1orf216"  "CALM1"     "CAMK2G"    "CCL5"      "CCND2"    ##  [25] "CCND3"     "CCR2"      "CCR5"      "CCSER2"    "CD2"       "CD247"    ##  [31] "CD28"      "CD3G"      "CD6"       "CDR2"      "CERK"      "CHST7"    ##  [37] "CISH"      "CLUAP1"    "CORO2A"    "CPD"       "CYB561"    "CYLD"     ##  [43] "DEXI"      "DNMT1"     "DOCK9"     "DPP4"      "DPYD"      "DUSP7"    ##  [49] "DYRK2"     "FAM102A"   "FBXL8"     "FHL1"      "FKBP5"     "FLT3LG"   ##  [55] "FYB1"      "FYN"       "GATA3"     "GIMAP4"    "GIMAP5"    "GIMAP6"   ##  [61] "GNAQ"      "GOLGA7"    "GPD1L"     "GSTK1"     "GZMA"      "GZMK"     ##  [67] "HDAC4"     "HPGD"      "ICOS"      "ID2"       "IFITM1"    "IGF2R"    ##  [73] "IL10RA"    "IL11RA"    "IL2RB"     "INPP4A"    "IPCEF1"    "ITGA6"    ##  [79] "ITK"       "ITM2A"     "ITPKB"     "KLRB1"     "KLRG1"     "LCK"      ##  [85] "LDHB"      "LDLRAP1"   "LEF1"      "LEPROTL1"  "LGALS8"    "LIMA1"    ##  [91] "LINC00623" "LPAR6"     "LRIG1"     "MAF"       "MAL"       "MGAT4A"   ##  [97] "MICAL2"    "MLLT3"     "MT1F"      "MT1X"      "MYBL1"     "NACC2"    ## [103] "NCALD"     "NELL2"     "NOSIP"     "OPTN"      "P2RX4"     "PDE3B"    ## [109] "PDE4D"     "PHACTR2"   "PIK3IP1"   "PIK3R1"    "PIM1"      "PIP4K2A"  ## [115] "PLAAT4"    "PLCG1"     "PLCL1"     "PLIN2"     "PLK3"      "PRDX2"    ## [121] "PRF1"      "PRKACB"    "PRKCA"     "PRKCH"     "PRKCQ"     "PRMT2"    ## [127] "PRORP"     "PTGER2"    "PTPN13"    "PTPN4"     "PXN"       "RAB33A"   ## [133] "RASGRP1"   "RCAN3"     "RGCC"      "RGS10"     "RIC8B"     "RNF144A"  ## [139] "RORA"      "RPS6KA3"   "RUNX1"     "S100A4"    "SAMHD1"    "SELPLG"   ## [145] "SEMA4C"    "SEMA4D"    "SERINC5"   "SFI1"      "SH2D1A"    "SH2D2A"   ## [151] "SIRPG"     "SKAP1"     "SLC39A8"   "SLC9A3R1"  "SOCS2"     "SPATS2L"  ## [157] "SRGN"      "STAT4"     "STAT5B"    "STOM"      "STXBP1"    "SVIL"     ## [163] "TBC1D2B"   "TBC1D4"    "TGFBR3"    "TIAM1"     "TIMP1"     "TKTL1"    ## [169] "TMEM123"   "TMEM30B"   "TMEM43"    "TNFAIP3"   "TNFRSF25"  "TNFSF10"  ## [175] "TNFSF14"   "TNFSF8"    "TNIK"      "TOB1"      "TRAT1"     "TRAV8-3"  ## [181] "TRBC1"     "TRBV10-2"  "TRDV2"     "TRPV2"     "TTLL1"     "TXK"      ## [187] "UBASH3A"   "UPP1"      "URI1"      "VAMP5"     "VIM"       "VPS26C"   ## [193] "WDR1"      "WDR82"     "WWP1"      "ZAP70"     "ZMYM6"     "ZNF277"

2. 不同打分方法横向比较

2.1 PercentageFeatureSet

2.1.1 仅保留基因集中存在于seurat对象中的基因+评分

a <- CD4_TCELL_VS_BCELL_UP[CD4_TCELL_VS_BCELL_UP$Gene %in% rownames(sce_final),]sce <- PercentageFeatureSet(sce_final,features = a,col.name = 'CD4_TCELL_VS_BCELL_UP_score')
colnames(sce@meta.data)

## [1] "orig.ident"                  "nCount_RNA"                 ## [3] "nFeature_RNA"                "percent.mt"                 ## [5] "RNA_snn_res.0.5"             "seurat_clusters"            ## [7] "celltype"                    "CD4_TCELL_VS_BCELL_UP_score"

# 可以看到sce对象的meta.data中新增了一列名为CD4_TCELL_VS_BCELL_UP_score的评分

2.1.2 可视化各个细胞类型CD4_TCELL_VS_BCELL_UP_score基因集的评分

my_comparisons <- list(c('Naive CD4 T','B'),c('Memory CD4 T','B'))library(ggpubr)p.percentage <- ggviolin(sce@meta.data, x = "celltype", y = "CD4_TCELL_VS_BCELL_UP_score",         color = "celltype",add = 'mean_sd',fill = 'celltype',         add.params = list(color = "black")) +   stat_compare_means(comparisons = my_comparisons,label = "p.signif") +   scale_color_manual(values = my9color) +   scale_fill_manual(values = my9color) +  theme(axis.text.x.bottom = element_text(angle = 90,vjust = 0.5,hjust = 1)) +   NoLegend() + labs(x = '')ggsave('./Results/percentagefeatureset_score_vln.png',width = 8,height = 4.5)p2 <- FeaturePlot(sce,'CD4_TCELL_VS_BCELL_UP_score')ggsave('./Results/percentagefeatureset_score_umap.png',width = 5.5,height = 5)

可以看到两类T细胞的基因集评分评分都显著高于B细胞。

2.2 AddModuleScore

AddModuleScore函数计算出的结果与PercentageFeatureSet函数会有什么差异呢？ > 注意：AddModuleScore使用的基因集对象需要提前转变为列表格式。

2.2.1 基因集列表制作及评分

genes <- agenes <- as.data.frame(genes)genes <- as.list(genes)
sce <- AddModuleScore(sce_final,features = genes,name = 'CD4_TCELL_VS_BCELL_UP_score')colnames(sce@meta.data)

## [1] "orig.ident"                   "nCount_RNA"                  ## [3] "nFeature_RNA"                 "percent.mt"                  ## [5] "RNA_snn_res.0.5"              "seurat_clusters"             ## [7] "celltype"                     "CD4_TCELL_VS_BCELL_UP_score1"

# 可以看到sce对象的meta.data中新增了一列名为CD4_TCELL_VS_BCELL_UP_score1的评分

2.2.2 可视化

my_comparisons <- list(c('Naive CD4 T','B'),c('Memory CD4 T','B'))library(ggpubr)p.AddModuleScore <- ggviolin(sce@meta.data, x = "celltype", y = "CD4_TCELL_VS_BCELL_UP_score1",         color = "celltype",add = 'mean_sd',fill = 'celltype',         add.params = list(color = "black")) +   stat_compare_means(comparisons = my_comparisons,label = "p.signif") +   scale_color_manual(values = my9color) +   scale_fill_manual(values = my9color) +  theme(axis.text.x.bottom = element_text(angle = 90,vjust = 0.5,hjust = 1)) +   NoLegend() + labs(x = '')ggsave('./Results/addmodulescore_score_vln.png',width = 8,height = 4.5)p2 <- FeaturePlot(sce,'CD4_TCELL_VS_BCELL_UP_score1')ggsave('./Results/addmodulescore_score_umap.png',width = 5.5,height = 5)

这里得到的结果与PercentageFeatureSet相似。 #### AddModuleScore和PercentageFeatureSet的差异区别其实很容易发现，从metadata的数据我们就可以观察到AddModuleScore得到的评分是有负值的。我们plot看一下两种方法，基因集评分值的分布情况：

# plot(sce$CD4_TCELL_VS_BCELL_UP_score)# plot(sce$CD4_TCELL_VS_BCELL_UP_score1)

PercentageFeatureSet：其评估得到的分数，均为正值，并且有不少的离散值分布。

AddModuleScore：可以看到，其评估得到的分数，有点类似scale之后的结果。

这个帖子有提到AddModuleScore的原理：Seurat的打分函数AddMouduleScore

2.3 AUCell

按照之前的讲解，我们走一遍AUCell的流程。

2.3.1 AUCell_buildRankings

library(AUCell)library(clusterProfiler)

‍

## ## clusterProfiler v3.18.1  For help: https://guangchuangyu.github.io/software/clusterProfiler## ## If you use clusterProfiler in published research, please cite:## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.## ## Attaching package: 'clusterProfiler'## The following object is masked from 'package:stats':## ##     filter

library(ggplot2)library(Seurat)library(GSEABase)

## Loading required package: BiocGenerics## Loading required package: parallel## ## Attaching package: 'BiocGenerics'## The following objects are masked from 'package:parallel':## ##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,##     clusterExport, clusterMap, parApply, parCapply, parLapply,##     parLapplyLB, parRapply, parSapply, parSapplyLB## The following objects are masked from 'package:dplyr':## ##     combine, intersect, setdiff, union## The following objects are masked from 'package:stats':## ##     IQR, mad, sd, var, xtabs## The following objects are masked from 'package:base':## ##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,##     union, unique, unsplit, which.max, which.min## Loading required package: Biobase## Welcome to Bioconductor## ##     Vignettes contain introductory material; view with##     'browseVignettes()'. To cite Bioconductor, see##     'citation("Biobase")', and for packages 'citation("pkgname")'.## Loading required package: annotate## Loading required package: AnnotationDbi## Loading required package: stats4## Loading required package: IRanges## Loading required package: S4Vectors## ## Attaching package: 'S4Vectors'## The following object is masked from 'package:clusterProfiler':## ##     rename## The following objects are masked from 'package:dplyr':## ##     first, rename## The following object is masked from 'package:base':## ##     expand.grid## ## Attaching package: 'IRanges'## The following object is masked from 'package:clusterProfiler':## ##     slice## The following objects are masked from 'package:dplyr':## ##     collapse, desc, slice## The following object is masked from 'package:grDevices':## ##     windows## ## Attaching package: 'AnnotationDbi'## The following object is masked from 'package:clusterProfiler':## ##     select## The following object is masked from 'package:dplyr':## ##     select## Loading required package: XML## Loading required package: graph## ## Attaching package: 'graph'## The following object is masked from 'package:XML':## ##     addNode

sce <- sce_final
# AUCell_buildRankingscells_rankings <- AUCell_buildRankings(sce@assays$RNA@counts,nCores=1, plotStats=TRUE)

## Quantiles for the number of genes detected by cell: ## (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).

##     min      1%      5%     10%     50%    100% ##  212.00  341.00  460.25  557.50  819.00 2413.00

# Quantiles for the number of genes detected by cell: #   (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).# min      1%      5%     10%     50%    100% # 212.00  341.00  460.25  557.50  819.00 2413.00

2.3.2 制作基因集

## 制作基因集geneSets <- aextraGeneSets <- c(GeneSet(sample(geneSets),setName="CD4_TCELL_VS_BCELL_UP_gene"))
geneSets1 <- GeneSetCollection(extraGeneSets)names(geneSets1)

## [1] "CD4_TCELL_VS_BCELL_UP_gene"

2.3.3 计算AUC

## 计算AUCcells_AUC <- AUCell_calcAUC(geneSets1, cells_rankings)cells_AUC

## AUC for 1 gene sets (rows) and 2626 cells (columns).## ## Top-left corner of the AUC matrix:##                             cells## gene sets                    AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1##   CD4_TCELL_VS_BCELL_UP_gene       0.05518402       0.03328588       0.08549763##                             cells## gene sets                    AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1##   CD4_TCELL_VS_BCELL_UP_gene       0.04875603       0.06796838

2.3.4 AUC阈值计算及可视化

pdf('./Results/cells_assignment.pdf',width = 6,height = 5)cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE) dev.off()

## png ##   2

##set gene set of interest here for plottinggeneSet <- "CD4_TCELL_VS_BCELL_UP_gene"aucs <- as.numeric(getAUC(cells_AUC)[geneSet, ])sce$AUC <- aucsdf<- data.frame(sce@meta.data, sce@reductions[["umap"]]@cell.embeddings)colnames(df)

##  [1] "orig.ident"      "nCount_RNA"      "nFeature_RNA"    "percent.mt"     ##  [5] "RNA_snn_res.0.5" "seurat_clusters" "celltype"        "AUC"            ##  [9] "UMAP_1"          "UMAP_2"

#我们看到每个细胞现在都加上AUC值了，下面做一下可视化。class_avg <- df %>%  group_by(celltype) %>%  summarise(    UMAP_1 = median(UMAP_1),    UMAP_2 = median(UMAP_2)  )p1 <- ggplot(df, aes(UMAP_1, UMAP_2))  +  geom_point(aes(colour  = AUC)) + viridis::scale_color_viridis(option="H") +  ggrepel::geom_text_repel(aes(label = celltype),                            data = class_avg,                            size = 6,                            segment.color = NA) +     theme(legend.position = "none") + theme_classic()ggsave('./Results/umap_auc.png',width = 5.5,height = 5)
p.auc <- ggviolin(df, x = "celltype", y = "AUC",         color = "celltype",add = 'mean_sd',fill = 'celltype',         add.params = list(color = "black")) +   stat_compare_means(comparisons = my_comparisons,label = "p.signif") +   scale_color_manual(values = my9color) +   scale_fill_manual(values = my9color) +  theme(axis.text.x.bottom = element_text(angle = 90,vjust = 0.5,hjust = 1)) +   NoLegend() + labs(x = '')ggsave('./Results/vio_auc.png',width = 8,height = 4.5)

可以看到AUC的阈值为0.098，大于此阈值的有28个细胞

可以看到，T和NK细胞区域具有相对较高的AUC得分，B细胞的评分最低，与AddModuleScore和PercentageFeatureSet的计算结果是一致的。

2.4 ssGSEA

2.4.1 加载R包

# devtools::install_github("nicolash2/gggsea")library(GSVA)#ssGSEA也可以通过GSVA包实现library(ggplot2)

2.4.2 ssGSEA分析

gene.expr <-  as.matrix(sce_final[["RNA"]]@data)dim(gene.expr)

gene4ssGSEA <- as.data.frame(CD4_TCELL_VS_BCELL_UP$Gene)colnames(gene4ssGSEA)<-'CD4_TCELL_VS_BCELL_UP'ssGSEA.result <- GSVA::gsva(gene.expr, gene4ssGSEA,method='ssgsea',kcdf='Gaussian')

## Estimating ssGSEA scores for 1 gene sets.

2.4.3 可视化

my_comparisons <- list(c('Naive CD4 T','B'),c('Memory CD4 T','B'))library(ggpubr)data4plot <-cbind(sce_final@meta.data,                  t(ssGSEA.result)[rownames(sce_final@meta.data),])data4plot[1:4,1:4]

##                     orig.ident nCount_RNA nFeature_RNA percent.mt## AAACATACAACCAC-1 SeuratProject       2419          779   3.100455## AAACATTGAGCTAC-1 SeuratProject       4903         1352   3.875178## AAACATTGATCAGC-1 SeuratProject       3147         1129   1.080394## AAACCGTGCTTCCG-1 SeuratProject       2639          960   1.970443

colnames(data4plot)[ncol(data4plot)] <- 'CD4_TCELL_VS_BCELL_UP_ssgsea'p.ssGSEA <- ggviolin(data4plot,                x = "celltype", y = "CD4_TCELL_VS_BCELL_UP_ssgsea",         color = "celltype",add = 'mean_sd',fill = 'celltype',         add.params = list(color = "black")) +   stat_compare_means(comparisons = my_comparisons,label = "p.signif") +   scale_color_manual(values = my9color) +   scale_fill_manual(values = my9color) +  theme(axis.text.x.bottom = element_text(angle = 90,vjust = 0.5,hjust = 1)) +   NoLegend() + labs(x = '')

2.5 GSVA

GSVA方法Biomamba生信基地之前发过推文的噢！链接在单细胞中应该如何做GSVA？这里参照之前的流程进行分析。

2.5.1 加载包

library(GSVA)library(ggplot2)

2.5.2 GSVA分析

gene.expr <-  as.matrix(sce_final[["RNA"]]@data)dim(gene.expr)

## [1] 13714  2626

gene4GSVA <- as.data.frame(CD4_TCELL_VS_BCELL_UP$Gene)colnames(gene4GSVA)<-'CD4_TCELL_VS_BCELL_UP'GSVA.result <- GSVA::gsva(gene.expr, gene4GSVA,method='gsva',kcdf='Gaussian')

## Estimating GSVA scores for 1 gene sets.## Estimating ECDFs with Gaussian kernels##   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%

2.5.3 可视化

my_comparisons <- list(c('Naive CD4 T','B'),c('Memory CD4 T','B'))library(ggpubr)data4plot <-cbind(sce_final@meta.data,                  t(GSVA.result)[rownames(sce_final@meta.data),])data4plot[1:4,1:4]

##                     orig.ident nCount_RNA nFeature_RNA percent.mt## AAACATACAACCAC-1 SeuratProject       2419          779   3.100455## AAACATTGAGCTAC-1 SeuratProject       4903         1352   3.875178## AAACATTGATCAGC-1 SeuratProject       3147         1129   1.080394## AAACCGTGCTTCCG-1 SeuratProject       2639          960   1.970443

colnames(data4plot)[ncol(data4plot)] <- 'CD4_TCELL_VS_BCELL_UP_GSVA'p.GSVA <- ggviolin(data4plot,                x = "celltype", y = "CD4_TCELL_VS_BCELL_UP_GSVA",         color = "celltype",add = 'mean_sd',fill = 'celltype',         add.params = list(color = "black")) +   stat_compare_means(comparisons = my_comparisons,label = "p.signif") +   scale_color_manual(values = my9color) +   scale_fill_manual(values = my9color) +  theme(axis.text.x.bottom = element_text(angle = 90,vjust = 0.5,hjust = 1)) +   NoLegend() + labs(x = '')p.GSVA

3. 一起对比一下吧

library(patchwork)(p.percentage|p.AddModuleScore)/(p.auc|p.GSVA|p.ssGSEA)

4. 版本信息

sessionInfo()

## R version 4.0.3 (2020-10-10)## Platform: x86_64-w64-mingw32/x64 (64-bit)## Running under: Windows 10 x64 (build 19044)## ## Matrix products: default## ## locale:## [1] LC_COLLATE=Chinese (Simplified)_China.936 ## [2] LC_CTYPE=Chinese (Simplified)_China.936   ## [3] LC_MONETARY=Chinese (Simplified)_China.936## [4] LC_NUMERIC=C                              ## [5] LC_TIME=Chinese (Simplified)_China.936    ## ## attached base packages:## [1] stats4    parallel  stats     graphics  grDevices utils     datasets ## [8] methods   base     ## ## other attached packages:##  [1] patchwork_1.1.1        GSVA_1.38.2            GSEABase_1.52.1       ##  [4] graph_1.68.0           annotate_1.68.0        XML_3.99-0.7          ##  [7] AnnotationDbi_1.52.0   IRanges_2.24.1         S4Vectors_0.28.1      ## [10] Biobase_2.50.0         BiocGenerics_0.36.1    clusterProfiler_3.18.1## [13] AUCell_1.12.0          ggpubr_0.4.0           readxl_1.3.1          ## [16] dplyr_1.0.7            png_0.1-7              ggplot2_3.3.5         ## [19] SeuratObject_4.0.2     Seurat_4.0.2          ## ## loaded via a namespace (and not attached):##   [1] utf8_1.2.2                  reticulate_1.20            ##   [3] R.utils_2.10.1              tidyselect_1.1.1           ##   [5] RSQLite_2.2.8               htmlwidgets_1.5.4          ##   [7] BiocParallel_1.24.1         grid_4.0.3                 ##   [9] Rtsne_0.15                  scatterpie_0.1.7           ##  [11] munsell_0.5.0               codetools_0.2-16           ##  [13] ragg_1.2.2                  ica_1.0-2                  ##  [15] future_1.22.1               miniUI_0.1.1.1             ##  [17] withr_2.4.2                 GOSemSim_2.16.1            ##  [19] colorspace_2.0-2            highr_0.9                  ##  [21] knitr_1.34                  ROCR_1.0-11                ##  [23] ggsignif_0.6.3              tensor_1.5                 ##  [25] DOSE_3.16.0                 listenv_0.8.0              ##  [27] MatrixGenerics_1.2.1        labeling_0.4.2             ##  [29] GenomeInfoDbData_1.2.4      polyclip_1.10-0            ##  [31] bit64_4.0.5                 farver_2.1.0               ##  [33] downloader_0.4              parallelly_1.28.1          ##  [35] vctrs_0.3.8                 generics_0.1.0             ##  [37] xfun_0.25                   R6_2.5.1                   ##  [39] GenomeInfoDb_1.26.7         graphlayouts_0.7.1         ##  [41] openintro_2.2.0             fgsea_1.16.0               ##  [43] DelayedArray_0.16.3         bitops_1.0-7               ##  [45] spatstat.utils_2.2-0        cachem_1.0.6               ##  [47] assertthat_0.2.1            promises_1.2.0.1           ##  [49] cherryblossom_0.1.0         scales_1.1.1               ##  [51] ggraph_2.0.5                enrichplot_1.10.2          ##  [53] gtable_0.3.0                globals_0.14.0             ##  [55] goftest_1.2-2               tidygraph_1.2.0            ##  [57] rlang_0.4.11                systemfonts_1.0.4          ##  [59] splines_4.0.3               rstatix_0.7.0              ##  [61] lazyeval_0.2.2              spatstat.geom_2.3-2        ##  [63] broom_0.7.9                 BiocManager_1.30.16        ##  [65] yaml_2.2.1                  reshape2_1.4.4             ##  [67] abind_1.4-5                 backports_1.2.1            ##  [69] httpuv_1.6.2                qvalue_2.22.0              ##  [71] tools_4.0.3                 ellipsis_0.3.2             ##  [73] spatstat.core_2.3-0         jquerylib_0.1.4            ##  [75] RColorBrewer_1.1-2          ggridges_0.5.3             ##  [77] Rcpp_1.0.7                  plyr_1.8.6                 ##  [79] zlibbioc_1.36.0             purrr_0.3.4                ##  [81] RCurl_1.98-1.4              rpart_4.1-15               ##  [83] deldir_1.0-6                viridis_0.6.1              ##  [85] pbapply_1.5-0               cowplot_1.1.1              ##  [87] zoo_1.8-9                   SummarizedExperiment_1.20.0##  [89] haven_2.4.3                 ggrepel_0.9.1              ##  [91] cluster_2.1.0               magrittr_2.0.1             ##  [93] data.table_1.14.2           RSpectra_0.16-0            ##  [95] scattermore_0.7             DO.db_2.9                  ##  [97] openxlsx_4.2.4              lmtest_0.9-38              ##  [99] RANN_2.6.1                  airports_0.1.0             ## [101] fitdistrplus_1.1-6          matrixStats_0.60.1         ## [103] hms_1.1.0                   mime_0.11                  ## [105] evaluate_0.14               xtable_1.8-4               ## [107] rio_0.5.27                  gridExtra_2.3              ## [109] compiler_4.0.3              tibble_3.1.5               ## [111] shadowtext_0.1.0            KernSmooth_2.23-17         ## [113] crayon_1.4.1                R.oo_1.24.0                ## [115] htmltools_0.5.2             segmented_1.3-4            ## [117] ggfun_0.0.4                 mgcv_1.8-33                ## [119] later_1.3.0                 tzdb_0.1.2                 ## [121] tidyr_1.1.4                 DBI_1.1.1                  ## [123] tweenr_1.0.2                MASS_7.3-53                ## [125] Matrix_1.3-4                car_3.0-11                 ## [127] readr_2.1.1                 R.methodsS3_1.8.1          ## [129] igraph_1.2.6                GenomicRanges_1.42.0       ## [131] forcats_0.5.1               pkgconfig_2.0.3            ## [133] rvcheck_0.1.8               foreign_0.8-80             ## [135] plotly_4.10.0               spatstat.sparse_2.0-0      ## [137] bslib_0.3.1                 XVector_0.30.0             ## [139] stringr_1.4.0               digest_0.6.27              ## [141] sctransform_0.3.2           RcppAnnoy_0.0.19           ## [143] spatstat.data_2.1-0         fastmatch_1.1-3            ## [145] rmarkdown_2.10              cellranger_1.1.0           ## [147] leiden_0.3.9                usdata_0.2.0               ## [149] uwot_0.1.10                 kernlab_0.9-29             ## [151] curl_4.3.2                  shiny_1.7.1                ## [153] lifecycle_1.0.1             nlme_3.1-149               ## [155] jsonlite_1.7.2              carData_3.0-4              ## [157] viridisLite_0.4.0           fansi_0.5.0                ## [159] pillar_1.6.4                lattice_0.20-41            ## [161] GO.db_3.12.1                fastmap_1.1.0              ## [163] httr_1.4.2                  survival_3.2-7             ## [165] glue_1.4.2                  zip_2.2.0                  ## [167] bit_4.0.4                   mixtools_1.2.0             ## [169] ggforce_0.3.3               stringi_1.7.4              ## [171] sass_0.4.0                  blob_1.2.2                 ## [173] textshaping_0.3.6           memoise_2.0.0              ## [175] irlba_2.3.3                 future.apply_1.8.1

5.总结

本次针对上述物种基因集打分方法做了汇总。不得不说，通过总结，我对这些方法也有了进一步的认识。
我们可以看到，这些方法所得到的结果均具有很好的一致性；在这个前提下，我们应该选择哪个方法可能是值得商榷的问题。
① PercentageFeatureSet函数并未对结果进行处理，raw结果的比较分析可能会受到离散值的影响，所以个人觉得可能不是很客观；相比之下，其他方法均不同程度的去除了离散值的影响，相对来说更为客观。
② 值得一提的是，GSEA使用的情况与其他函数/算法并不相同，其首先需要进行差异基因的计算，然后再去观察针对特定基因集差异基因所富集的情况。其他算法则不需要进行差异基因计算，但总而言之，评分/富集程度的差异也同时反映了不同细胞/评估对象基因表达的差异。万变不离其宗。
你觉得那种方法更简单呢？
如果是小编，我会更喜欢AddModuleScore函数。一行搞定，不耗时！耗时短除了PercentageFeatureSet函数可以相媲美（当然缺点很明显），其他方法相对AddModuleScore来说都过于复杂了。
不短了，今天就总结到这里吧，希望这个教程能帮到你们这些爱学习生信的小可爱们，一起加油吧~~

往期回顾：

单细胞数据分析系列教程：

B站视频，先看一遍视频再去看推送操作，建议至少看三遍：

https://www.bilibili.com/video/BV1S44y1b76Z/

单细胞测序基础数据分析保姆级教程，代码部分整理在往期推送之中：

手把手教你做单细胞测序数据分析（一）——绪论

手把手教你做单细胞测序数据分析（二）——各类输入文件读取

手把手教你做单细胞测序数据分析（三）——单样本分析

手把手教你做单细胞测序数据分析（四）——多样本整合

手把手教你做单细胞测序数据分析（五）——细胞类型注释

手把手教你做单细胞测序数据分析（六）——组间差异分析及可视化

手把手教你做单细胞测序数据分析（七）——基因集富集分析