irGSEA：基于秩次的单细胞基因集富集分析整合框架

irGSEA：基于秩次的单细胞基因集富集分析整合框架 by 生信技能树

距离上一次介绍irGSEA，已经是两年前了，详见：8种方法可视化你的单细胞基因集打分，目前Seurat都已经更新到了V5，假如你不喜欢最新版的Seurat包的单细胞理念，正好这个irGSEA也是与时俱进，不仅仅是更新到了17种基因集合打分算法，而且是配合V4和V5版本的Seurat工作流，非常的方便！如果大家看完了这个万字长文的介绍，仍然是愿意深入了解irGSEA可以看看文末的交流群哈！

背景

许多Functional Class Scoring (FCS)方法，如GSEA, GSVA,PLAGE, addModuleScore, SCSE, Vision, VAM, gficf, pagoda2和Sargent，都会受数据集组成的影响，数据集组成的轻微变化将改变细胞的基因集富集分数。

假如将新的单细胞数据集整合到现有数据中，使用这些FCS方法需要重新计算每个细胞的基因集富集分数。这个步骤可能是繁琐且资源密集的。

相反，基于单个细胞表达等级的FCS，如AUCell、UCell、singscore、ssGSEA、JASMINE和Viper，只需要计算新添加的单细胞数据集的富集分数，而无需重新计算所有细胞的基因集富集分数。原因是这些方法生成的富集分数仅依赖于单个细胞水平上的相对基因表达，与数据集组成无关。因此，这些方法可以节省大量的时间。

审视结果

在这里，我们审视了17种常见的FCS方法：

GSEA 检测排序基因列表顶部或底部的基因集富集程度，该列表是分组后计算排序基因信噪比或排序基因倍数变化得到的；
GSVA 估计所有细胞之间每个基因的累积密度函数的核。这个过程中需要考虑所有样本，容易受到样本背景信息的影响；
PLAGE 对跨细胞的基因表达矩阵进行标准化，并提取奇异值分解作为基因集富集分数；
Zscore 聚合了基因集中所有基因的表达，通过细胞间的平均值和标准差缩放表达；
AddModuleScore需要先计算基因集中所有基因的平均值，再根据平均值把表达矩阵切割成若干份，然后从切割后的每一份中随机抽取对照基因（基因集外的基因）作为背景值。因此，在整合不同样本的情况下，即使使用相同基因集为相同细胞打分，也会产生不同的富集评分；
SCSE 使用基因集所有基因的归一化的总和来量化基因集富集分数；
Vision 使用随机签名的预期均值和方差对基因集富集分数进行 z 归一化从而校正基因集富集分数；
VAM 根据经典Mahalanobis多元距离从单细胞 RNA 测序数据生成基因集富集分数；
Gficf 利用通过非负矩阵分解获得的基因表达值的潜在因子的信息生物信号；
Pagoda2 拟合每个细胞的误差模型，并使用其第一个加权主成分量化基因集富集分数；
AUCell 基于单个样本中的基因表达排名,使用曲线下面积来评估输入基因集是否在单个样本的前5%表达基因内富集；
UCell 基于单个样本的基因表达排名，使用Mann-Whitney U统计量计算单个样本的基因集富集分数；
Singscore 根据基因表达等级评估距单个细胞中心的距离。基因集中的基因根据单个细胞中的转录本丰度进行排序。平均等级相对于理论最小值和最大值单独标准化，以零为中心，然后聚合，所得分数代表基因集的富集分数；
ssGSEA 根据每个细胞的基因表达等级计算内部和外部基因集之间的经验累积分布的差异分数。使用全局表达谱对差异分数进行标准化。标准化这一步容易受样本构成的影响。
JASMINE 根据在单个细胞中表达基因中的基因排名和表达基因中基因集的富集度计算近似平均值。这两个值均标准化为 0-1 范围，并通过平均进行组合，得出基因集的最终富集分数。
Viper 通过根据细胞间基因表达的排名执行three-tailed计算来估计基因集的富集分数。
Sargent 将给定细胞的非零表达基因从高表达到低表达进行排序，并将输入的基因逐细胞表达矩阵转换为相应的gene-set-by-cell assignment score matrix。但Sargent 需要计算细胞间的gini-index后，将按gene-set-by-cell assignment score matrix转换为distribution of indexes。

工作流程

使用AUCell、UCell、singscore、ssgsea、JASMINE 和 viper分别对各个细胞进行评分，得到不同的富集评分矩阵。
通过wilcoxon检验计算不同的富集评分矩阵中每个细胞亚群差异表达的基因集。up或down表示该细胞簇内差异基因集的富集程度高于或低于其他簇。

单一的基因集富集分析方法不仅只能反映有限的信息，而且也容易带来误差。我们期待从多个角度解释复杂的生物学问题，并找到生物学问题中的共性部分。简单地为多种基因集富集分析方法的结果取共同交集，不仅容易得到少而保守的结果，而且忽略了富集分析方法中很多的其他信息，例如不同基因集的相对富集程度信息。

我们希望目标基因集在大部分富集分析方法中都是富集且富集程度没有明显差异。因此，我们通过RobustRankAggreg包中的秩聚合算法(robust rank aggregation, RRA)对差异分析的结果进行评估，筛选出在6种方法中表现出相似的富集程度的差异基因集。

irGSEA安装

1.irGSEA安装（基础配置）

仅使用 AUCell, UCell, singscore, ssGSEA, JASMINE和viper

# install packages from CRAN
cran.packages <- c("aplot", "BiocManager", "data.table", "devtools", 
                   "doParallel", "doRNG", "dplyr", "ggfun", "gghalves", 
                   "ggplot2", "ggplotify", "ggridges", "ggsci", "irlba",
                   "magrittr", "Matrix", "msigdbr", "pagoda2", "pointr", 
                   "purrr", "RcppML", "readr", "reshape2", "reticulate", 
                   "rlang", "RMTstat", "RobustRankAggreg", "roxygen2", 
                   "Seurat", "SeuratObject", "stringr", "tibble", "tidyr", 
                   "tidyselect", "tidytree", "VAM")

for (i in cran.packages) {
  if (!requireNamespace(i, quietly = TRUE)) {
    install.packages(i, ask = F, update = F)
  }
}

# install packages from Bioconductor
bioconductor.packages <- c("AUCell", "BiocParallel", "ComplexHeatmap", 
                           "decoupleR", "fgsea", "ggtree", "GSEABase", 
                           "GSVA", "Nebulosa", "scde", "singscore",
                           "SummarizedExperiment", "UCell",
                           "viper","sparseMatrixStats")

for (i in bioconductor.packages) {
  if (!requireNamespace(i, quietly = TRUE)) {
    install.packages(i, ask = F, update = F)
  }
}

# install packages from Github
if (!requireNamespace("irGSEA", quietly = TRUE)) { 
    devtools::install_github("chuiqin/irGSEA", force =T)
}

2.irGSEA安装（进阶配置）

想使用VISION, gficf, Sargent, ssGSEApy, GSVApy等其他方法（这部分是可选安装）

# VISION
if (!requireNamespace("VISION", quietly = TRUE)) { 
    devtools::install_github("YosefLab/VISION", force =T)
}

# mdt need ranger
if (!requireNamespace("ranger", quietly = TRUE)) { 
  devtools::install_github("imbs-hl/ranger", force =T)
}

# gficf need RcppML (version > 0.3.7) package
if (!utils::packageVersion("RcppML") > "0.3.7") {
  message("The version of RcppML should greater than 0.3.7 and install RcppML package from Github")
  devtools::install_github("zdebruine/RcppML", force =T)
}

# please first `library(RcppML)` if you want to perform gficf
if (!requireNamespace("gficf", quietly = TRUE)) { 
    devtools::install_github("gambalab/gficf", force =T)
}

# GSVApy and ssGSEApy need SeuratDisk package
if (!requireNamespace("SeuratDisk", quietly = TRUE)) { 
    devtools::install_github("mojaveazure/seurat-disk", force =T)
}

# sargent
if (!requireNamespace("sargent", quietly = TRUE)) { 
    devtools::install_github("Sanofi-Public/PMCB-Sargent", force =T)
}

# pagoda2 need scde package
if (!requireNamespace("scde", quietly = TRUE)) { 
  devtools::install_github("hms-dbmi/scde", force =T)
}

# if error1 (functio 'sexp_as_cholmod_sparse' not provided by package 'Matrix')
# or error2 (functio 'as_cholmod_sparse' not provided by package 'Matrix') occurs
# when you perform pagoda2, please check the version of irlba and Matrix
# It's ok when I test as follow：
# R 4.2.2 irlba(v 2.3.5.1) Matrix(1.5-3)
# R 4.3.1 irlba(v 2.3.5.1) Matrix(1.6-1.1)
# R 4.3.2 irlba(v 2.3.5.1) Matrix(1.6-3)


#### create conda env
# If error (Unable to find conda binary. Is Anaconda installed) occurs, 
# please perform `reticulate::install_miniconda()`
if (! "irGSEA" %in% reticulate::conda_list()$name) {
      reticulate::conda_create("irGSEA")
}

# if python package exist
python.package <- reticulate::py_list_packages(envname = "irGSEA")$package
require.package <- c("anndata", "scanpy", "argparse", "gseapy", "decoupler")
for (i in seq_along(require.package)) {
  if (i %in% python.package) {
    reticulate::conda_install(envname = "irGSEA", packages = i, pip = T)
  }
}

3.国内镜像加速安装

安装github包和pip包有困难的参考这里，我把github包克隆到了gitee上

options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor/")
options("repos" = c(CRAN="http://mirrors.cloud.tencent.com/CRAN/"))

# install packages from CRAN
cran.packages <- c("aplot", "BiocManager", "data.table", "devtools", 
                   "doParallel", "doRNG", "dplyr", "ggfun", "gghalves", 
                   "ggplot2", "ggplotify", "ggridges", "ggsci", "irlba",
                   "magrittr", "Matrix", "msigdbr", "pagoda2", "pointr", 
                   "purrr", "RcppML", "readr", "reshape2", "reticulate", 
                   "rlang", "RMTstat", "RobustRankAggreg", "roxygen2", 
                   "Seurat", "SeuratObject", "stringr", "tibble", "tidyr", 
                   "tidyselect", "tidytree", "VAM")

for (i in cran.packages) {
  if (!requireNamespace(i, quietly = TRUE)) {
    install.packages(i, ask = F, update = F)
  }
}

# install packages from Bioconductor
bioconductor.packages <- c("AUCell", "BiocParallel", "ComplexHeatmap", 
                           "decoupleR", "fgsea", "ggtree", "GSEABase", 
                           "GSVA", "Nebulosa", "scde", "singscore",
                           "SummarizedExperiment", "UCell", "viper")

for (i in bioconductor.packages) {
  if (!requireNamespace(i, quietly = TRUE)) {
    install.packages(i, ask = F, update = F)
  }
}

# install packages from git
if (!requireNamespace("irGSEA", quietly = TRUE)) { 
    devtools::install_git("https://gitee.com/fan_chuiqin/irGSEA.git", force =T)
}
# VISION
if (!requireNamespace("VISION", quietly = TRUE)) { 
    devtools::install_git("https://gitee.com/fan_chuiqin/VISION.git", force =T)
}

# mdt need ranger
if (!requireNamespace("ranger", quietly = TRUE)) { 
  devtools::install_git("https://gitee.com/fan_chuiqin/ranger.git", force =T)
}

# gficf need RcppML (version > 0.3.7) package
if (!utils::packageVersion("RcppML") > "0.3.7") {
  message("The version of RcppML should greater than 0.3.7 and install RcppML package from Git")
  devtools::install_git("https://gitee.com/fan_chuiqin/RcppML.git", force =T)
}

# please first `library(RcppML)` if you want to perform gficf
if (!requireNamespace("gficf", quietly = TRUE)) { 
    devtools::install_git("https://gitee.com/fan_chuiqin/gficf.git", force =T)
}

# GSVApy and ssGSEApy need SeuratDisk package
if (!requireNamespace("SeuratDisk", quietly = TRUE)) { 
    devtools::install_git("https://gitee.com/fan_chuiqin/seurat-disk.git", 
                          force =T)}

# sargent
if (!requireNamespace("sargent", quietly = TRUE)) { 
    devtools::install_git("https://gitee.com/fan_chuiqin/PMCB-Sargent.git", 
                          force =T)}

# pagoda2 need scde package
if (!requireNamespace("scde", quietly = TRUE)) { 
  devtools::install_git("https://gitee.com/fan_chuiqin/scde.git", force =T)
}

#### create conda env
# If error (Unable to find conda binary. Is Anaconda installed) occurs, 
# please perform `reticulate::install_miniconda()`
if (! "irGSEA" %in% reticulate::conda_list()$name) {
      reticulate::conda_create("irGSEA")
}

# if python package exist
python.package <- reticulate::py_list_packages(envname = "irGSEA")$package
require.package <- c("anndata", "scanpy", "argparse", "gseapy", "decoupler")
for (i in require.package) {
  if (! i %in% python.package) {
    reticulate::conda_install(envname = "irGSEA", packages = i, pip = T,
     pip_options = "-i https://pypi.tuna.tsinghua.edu.cn/simple")
  }
}

使用教程

1.irGSEA支持Seurat 对象（V5或V4），Assay对象（V5或V4）

# 我们通过SeuratData包加载示例数据集（注释好的PBMC数据集）作为演示

#### Seurat V4对象 ####
library(Seurat)
library(SeuratData)
library(RcppML)
library(irGSEA)
data("pbmc3k.final")
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA", 
                             slot = "data", seeds = 123, ncores = 1,
                             min.cells = 3, min.feature = 0,
                             custom = F, geneset = NULL, msigdb = T, 
                             species = "Homo sapiens", category = "H",  
                             subcategory = NULL, geneid = "symbol",
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             aucell.MaxRank = NULL, ucell.MaxRank = NULL, 
                             kcdf = 'Gaussian')
 Assays(pbmc3k.final)
>[1] "RNA"       "AUCell"    "UCell"     "singscore" "ssgsea"    "JASMINE"   "viper" 

#### Seurat V5对象 ####
data("pbmc3k.final")
pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)
pbmc3k.final2 <- CreateSeuratObject(counts = CreateAssay5Object(GetAssayData(pbmc3k.final, assay = "RNA", slot = "counts")),
                                    meta.data = pbmc3k.final[[]])
pbmc3k.final2 <- NormalizeData(pbmc3k.final2)
pbmc3k.final2 <- irGSEA.score(object = pbmc3k.final2,assay = "RNA", 
                             slot = "data", seeds = 123, ncores = 1,
                             min.cells = 3, min.feature = 0,
                             custom = F, geneset = NULL, msigdb = T, 
                             species = "Homo sapiens", category = "H",  
                             subcategory = NULL, geneid = "symbol",
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             aucell.MaxRank = NULL, ucell.MaxRank = NULL, 
                             kcdf = 'Gaussian')
Assays(pbmc3k.final2)
[1] "RNA"       "AUCell"    "UCell"     "singscore" "ssgsea"    "JASMINE"   "viper" 

#### Assay5 对象  ####
data("pbmc3k.final")
pbmc3k.final <- SeuratObject::UpdateSeuratObject(pbmc3k.final)
pbmc3k.final3 <- CreateAssay5Object(counts = GetAssayData(pbmc3k.final, assay = "RNA", slot = "counts"))
pbmc3k.final3 <- NormalizeData(pbmc3k.final3)
pbmc3k.final3 <- irGSEA.score(object = pbmc3k.final3,assay = "RNA", 
                              slot = "counts", seeds = 123, ncores = 1,
                              min.cells = 3, min.feature = 0,
                              custom = F, geneset = NULL, msigdb = T, 
                              species = "Homo sapiens", category = "H",  
                              subcategory = NULL, geneid = "symbol",
                              method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                              aucell.MaxRank = NULL, ucell.MaxRank = NULL, 
                              kcdf = 'Gaussian')
Assays(pbmc3k.final3)
# Assay5对象存放在RNA assay中

结果文件大小统计：

# 看起来Seurat V5和Assay 5确实存放数据会比较小
> cat(object.size(pbmc3k.final) / (1024^2), "M")
339.955 M
> cat(object.size(pbmc3k.final2) / (1024^2), "M")
69.33521 M
> cat(object.size(pbmc3k.final3) / (1024^2), "M")
69.27851 M

2.irGSEA支持的基因集打分方法

测试了不同数据大小下各种评分方法使用50个Hallmark基因集进行打分所需的时间和内存峰值，大家根据自己的电脑和时间进行酌情选择；

GSVApy、ssGSEApy 和 viperpy 分别代表 GSVA、ssGSEA 和 viper 的 Python 版本。Python版本的GSVA比R版本的GSVA节约太多时间了。

我们对singscore、ssGSEA、JASMINE、viper的内存峰值进行了优化。对于超过 50000 个细胞的数据集，我们实施了一种策略，将它们划分为5000 个细胞/单元进行评分。虽然这可以缓解内存峰值问题，但确实会延长处理时间。

3.irGSEA支持的基因集打分方法

为了方便用户获取MSigDB数据库中预先定义好的基因集，我们内置了msigdbr包进行MSigDB的基因集数据的获取。msigdbr包支持多个物种的基因集获取，以及多种基因格式的表达矩阵的输入。

①irGSEA通过内置的msigdbr包进行打分

library(Seurat)
library(SeuratData)
library(RcppML)
library(irGSEA)
data("pbmc3k.final")

#### Hallmark基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA", 
                             slot = "data", seeds = 123, ncores = 1,
                             min.cells = 3, min.feature = 0,
                             custom = F, geneset = NULL, msigdb = T, 
                             species = "Homo sapiens", category = "H",  
                             subcategory = NULL, geneid = "symbol",
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             aucell.MaxRank = NULL, ucell.MaxRank = NULL, 
                             kcdf = 'Gaussian')

#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA", 
                             slot = "data", seeds = 123, ncores = 1,
                             min.cells = 3, min.feature = 0,
                             custom = F, geneset = NULL, msigdb = T, 
                             species = "Homo sapiens", category = "C2",  
                             subcategory = "CP:KEGG", geneid = "symbol",
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             aucell.MaxRank = NULL, ucell.MaxRank = NULL, 
                             kcdf = 'Gaussian')

#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA", 
                             slot = "data", seeds = 123, ncores = 1,
                             min.cells = 3, min.feature = 0,
                             custom = F, geneset = NULL, msigdb = T, 
                             species = "Homo sapiens", category = "C5",  
                             subcategory = "GO:BP", geneid = "symbol",
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             aucell.MaxRank = NULL, ucell.MaxRank = NULL, 
                             kcdf = 'Gaussian')

②irGSEA使用最新版的MSigDB基因集进行打分

遗憾的是，msigdbr包内置的MSigDB的版本是MSigDB 2022.1。然而，目前MSigDB数据库已经更新到了2023.2，包括2023.2.Hs和2023.2.Mm。我们可以从这个链接(https://data.broadinstitute.org/gsea-msigdb/msigdb/release/)下载2023.2.Hs的gmt文件或者db.zip文件。相比gmt文件，db.zip文件包含了基因集的描述，可以用来筛选XX功能相关基因。下面的例子中，我将介绍如何筛选血管生成相关的基因集。

#### work with newest Msigdb ####

# https://data.broadinstitute.org/gsea-msigdb/msigdb/release/
# In this page, you can download human/mouse gmt file or db.zip file
# The db.zip file contains metadata information for the gene set

# load library
library(clusterProfiler)
library(tidyverse)
library(DBI)
library(RSQLite)


### db.zip ###

# download zip file and unzip zip file
zip_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/msigdb_v2023.2.Hs.db.zip"
local_zip_path <- "./msigdb_v2023.2.Hs.db.zip"
download.file(zip_url, local_zip_path)
unzip(local_zip_path, exdir = "./")

# code modified by https://rdrr.io/github/cashoes/sear/src/data-raw/1_parse_msigdb_sqlite.r
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = './msigdb_v2023.2.Hs.db')
DBI::dbListTables(con)

# define tables we want to combine
geneset_db <- dplyr::tbl(con, 'gene_set')                                              # standard_name, collection_name
details_db <- dplyr::tbl(con, 'gene_set_details')                                      # description_brief, description_full
geneset_genesymbol_db <- dplyr::tbl(con, 'gene_set_gene_symbol')                       # meat and potatoes
genesymbol_db <- dplyr::tbl(con, 'gene_symbol')                                        # mapping from ids to gene symbols
collection_db <- dplyr::tbl(con, 'collection') %>% dplyr::select(collection_name, full_name)  # collection metadata

# join tables
msigdb <- geneset_db %>%
  dplyr::left_join(details_db, by = c('id' = 'gene_set_id')) %>%
  dplyr::left_join(collection_db, by = 'collection_name') %>%
  dplyr::left_join(geneset_genesymbol_db, by = c('id' = 'gene_set_id')) %>%
  dplyr::left_join(genesymbol_db, by = c('gene_symbol_id' = 'id')) %>%
  dplyr::select(collection = collection_name, subcollection = full_name, geneset = standard_name, description = description_brief, symbol) %>%
  dplyr::as_tibble() 

# clean up
DBI::dbDisconnect(con)


unique(msigdb$collection)
# [1] "C1"                 "C2:CGP"             "C2:CP:BIOCARTA"    
# [4] "C2:CP:KEGG_LEGACY"  "C2:CP:PID"          "C3:MIR:MIRDB"      
# [7] "C3:MIR:MIR_LEGACY"  "C3:TFT:GTRD"        "C3:TFT:TFT_LEGACY" 
# [10] "C4:3CA"             "C4:CGN"             "C4:CM"             
# [13] "C6"                 "C7:IMMUNESIGDB"     "C7:VAX"            
# [16] "C8"                 "C5:GO:BP"           "C5:GO:CC"          
# [19] "C5:GO:MF"           "H"                  "C5:HPO"            
# [22] "C2:CP:KEGG_MEDICUS" "C2:CP:REACTOME"     "C2:CP:WIKIPATHWAYS"
# [25] "C2:CP" 
unique(msigdb$subcollection)
# [1] "C1"                 "C2:CGP"             "C2:CP:BIOCARTA"    
# [4] "C2:CP:KEGG_LEGACY"  "C2:CP:PID"          "C3:MIR:MIRDB"      
# [7] "C3:MIR:MIR_LEGACY"  "C3:TFT:GTRD"        "C3:TFT:TFT_LEGACY" 
# [10] "C4:3CA"             "C4:CGN"             "C4:CM"             
# [13] "C6"                 "C7:IMMUNESIGDB"     "C7:VAX"            
# [16] "C8"                 "C5:GO:BP"           "C5:GO:CC"          
# [19] "C5:GO:MF"           "H"                  "C5:HPO"            
# [22] "C2:CP:KEGG_MEDICUS" "C2:CP:REACTOME"     "C2:CP:WIKIPATHWAYS"
# [25] "C2:CP" 

# convert to list[Hallmark] required by irGSEA package
msigdb.h <- msigdb %>% 
  dplyr::filter(collection=="H") %>% 
  dplyr::select(c("geneset", "symbol"))
msigdb.h$geneset <- factor(msigdb.h$geneset)
msigdb.h <- msigdb.h %>% 
  dplyr::group_split(geneset, .keep = F) %>%
  purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
  purrr::set_names(levels(msigdb.h$geneset))
#### Hallmark基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
                             custom = T, geneset = msigdb.h, 
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             kcdf = 'Gaussian')

# convert to list[go bp] required by irGSEA package
msigdb.go.bp <- msigdb %>% 
  dplyr::filter(collection=="C5:GO:BP") %>% 
  dplyr::select(c("geneset", "symbol"))
msigdb.go.bp$geneset <- factor(msigdb.go.bp$geneset)
msigdb.go.bp <- msigdb.go.bp %>% 
  dplyr::group_split(geneset, .keep = F) %>%
  purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
  purrr::set_names(levels(msigdb.go.bp$geneset))
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
                             custom = T, geneset = msigdb.go.bp, 
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             kcdf = 'Gaussian')

# convert to list[KEGG] required by irGSEA package
msigdb.kegg <- msigdb %>% 
  dplyr::filter(collection=="C2:CP:KEGG_MEDICUS") %>% 
  dplyr::select(c("geneset", "symbol"))
msigdb.kegg$geneset <- factor(msigdb.kegg$geneset)
msigdb.kegg <- msigdb.kegg %>% 
  dplyr::group_split(geneset, .keep = F) %>%
  purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
  purrr::set_names(levels(msigdb.kegg$geneset))
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
                             custom = T, geneset = msigdb.kegg, 
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             kcdf = 'Gaussian')


# Look for the gene sets associated with angiogenesis from gene sets names and 
# gene sets descriptions

category <- c("angiogenesis", "vessel")
msigdb.vessel <- list()
for (i in category) {
  # Ignore case matching
  find.index.description <- stringr::str_detect(msigdb$description, pattern = regex(all_of(i), ignore_case=TRUE))
  find.index.name <- stringr::str_detect(msigdb$geneset, pattern = regex(all_of(i), ignore_case=TRUE))
  msigdb.vessel[[i]] <- msigdb[find.index.description | find.index.name, ] %>% mutate(category = i)
  
}
msigdb.vessel <- do.call(rbind, msigdb.vessel)

head(msigdb.vessel)
# # A tibble: 6 × 6
# collection subcollection                      geneset            description    symbol category
# <chr>      <chr>                              <chr>              <chr>          <chr>  <chr>   
#   1 C2:CGP     Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … HECW1  angioge…
# 2 C2:CGP     Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … JADE2  angioge…
# 3 C2:CGP     Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … SEMA3C angioge…
# 4 C2:CGP     Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … STUB1  angioge…
# 5 C2:CGP     Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … FAH    angioge…
# 6 C2:CGP     Chemical and Genetic Perturbations HU_ANGIOGENESIS_UP Up-regulated … COL7A1 angioge…

length(unique(msigdb.vessel$geneset))
# [1] 112

# convert gene sets associated with angiogenesis to list 
# required by irGSEA package
msigdb.vessel <- msigdb.vessel %>% 
  dplyr::select(c("geneset", "symbol"))
msigdb.vessel$geneset <- factor(msigdb.vessel$geneset)
msigdb.vessel <- msigdb.vessel %>% 
  dplyr::group_split(geneset, .keep = F) %>%
  purrr::map( ~.x %>% dplyr::pull(symbol) %>% unique(.)) %>%
  purrr::set_names(levels(msigdb.vessel$geneset))
#### 血管生成相关基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
                             custom = T, geneset = msigdb.vessel, 
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             kcdf = 'Gaussian')

### gmt file ###

# download gmt file
gmt_url <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/msigdb.v2023.2.Hs.symbols.gmt"
local_gmt <- "./msigdb.v2023.2.Hs.symbols.gmt"
download.file(gmt_url , local_gmt)

msigdb <- clusterProfiler::read.gmt("./msigdb.v2023.2.Hs.symbols.gmt")

# convert to list[hallmarker] required by irGSEA package
msigdb.h <- msigdb %>% 
  dplyr::filter(str_detect(term, pattern = regex("HALLMARK_", ignore_case=TRUE)))
msigdb.h$term <- factor(msigdb.h$term)
msigdb.h <- msigdb.h %>% 
  dplyr::group_split(term, .keep = F) %>%
  purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>%
  purrr::set_names(levels(msigdb.h$term))
#### Hallmark基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
                             custom = T, geneset = msigdb.h, 
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             kcdf = 'Gaussian')

# convert to list[go bp] required by irGSEA package

msigdb.go.bp <- msigdb %>% 
  dplyr::filter(str_detect(term, pattern = regex("GOBP_", ignore_case=TRUE)))
msigdb.go.bp$term <- factor(msigdb.go.bp$term)
msigdb.go.bp <- msigdb.go.bp %>% 
  dplyr::group_split(term, .keep = F) %>%
  purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>%
  purrr::set_names(levels(msigdb.go.bp$term))
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
                             custom = T, geneset = msigdb.go.bp, 
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             kcdf = 'Gaussian')

# convert to list[KEGG] required by irGSEA package
msigdb.kegg <- msigdb %>% 
  dplyr::filter(str_detect(term, pattern = regex("KEGG_", ignore_case=TRUE)))
msigdb.kegg$term <- factor(msigdb.kegg$term)
msigdb.kegg <- msigdb.kegg %>% 
  dplyr::group_split(term, .keep = F) %>%
  purrr::map( ~.x %>% dplyr::pull(gene) %>% unique(.)) %>%
  purrr::set_names(levels(msigdb.kegg$term))
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
                             custom = T, geneset = msigdb.kegg, 
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             kcdf = 'Gaussian')

③irGSEA使用clusterProfiler包同款基因集进行打分

#### work with clusterProfiler package ####
# load library
library(clusterProfiler)
library(tidyverse)

### kegg ###
# download kegg pathway (human) and write as gson file
kk <- clusterProfiler::gson_KEGG(species = "hsa")
gson::write.gson(kk, file = "./KEGG_20231128.gson")

# read gson file
kk2 <- gson::read.gson("./KEGG_20231123.gson")
# Convert to a data frame
kegg.list <- dplyr::left_join(kk2@gsid2name,
                              kk2@gsid2gene,
                              by = "gsid")
head(kegg.list)
# gsid               name      gene
# 1 hsa01100 Metabolic pathways        10
# 2 hsa01100 Metabolic pathways       100
# 3 hsa01100 Metabolic pathways     10005
# 4 hsa01100 Metabolic pathways     10007
# 5 hsa01100 Metabolic pathways 100137049
# 6 hsa01100 Metabolic pathways     10020

# Convert gene ID to gene symbol
gene_name <- clusterProfiler::bitr(kegg.list$gene, 
                                   fromType = "ENTREZID", 
                                   toType = "SYMBOL", 
                                   OrgDb = "org.Hs.eg.db")
kegg.list <- dplyr::full_join(kegg.list,
                              gene_name,
                              by = c("gene"="ENTREZID"))
# remove NA value if exist
kegg.list <- kegg.list[complete.cases(kegg.list[, c("gene", "SYMBOL")]), ]
head(kegg.list)
# gsid               name      gene  SYMBOL
# 1 hsa01100 Metabolic pathways        10    NAT2
# 2 hsa01100 Metabolic pathways       100     ADA
# 3 hsa01100 Metabolic pathways     10005   ACOT8
# 4 hsa01100 Metabolic pathways     10007  GNPDA1
# 5 hsa01100 Metabolic pathways 100137049 PLA2G4B
# 6 hsa01100 Metabolic pathways     10020     GNE

# convert to list required by irGSEA package
kegg.list$name <- factor(kegg.list$name)
kegg.list <- kegg.list %>% 
  dplyr::group_split(name, .keep = F) %>%
  purrr::map( ~.x %>% dplyr::pull(SYMBOL) %>% unique(.)) %>%
  purrr::set_names(levels(kegg.list$name))
head(kegg.list)
### 整理好之后就可以进行KEGG打分
#### KEGG基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
                             custom = T, geneset = kegg.list, 
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             kcdf = 'Gaussian')

### go bp ###
# download go bp (human) and write as gson file
go <- clusterProfiler::gson_GO(OrgDb = "org.Hs.eg.db", ont = "BP")
gson::write.gson(go, file = "./go_20231128.gson")

# read gson file
go2 <- gson::read.gson("./go_20231128.gson")

# Convert to a data frame
go.list <- dplyr::left_join(go2@gsid2name,
                            go2@gsid2gene,
                            by = "gsid")
head(go.list)
# gsid                             name gene
# 1 GO:0000001        mitochondrion inheritance <NA>
#   2 GO:0000002 mitochondrial genome maintenance  142
# 3 GO:0000002 mitochondrial genome maintenance  291
# 4 GO:0000002 mitochondrial genome maintenance 1763
# 5 GO:0000002 mitochondrial genome maintenance 1890
# 6 GO:0000002 mitochondrial genome maintenance 2021

# Convert gene ID to gene symbol
go.list <- dplyr::full_join(go.list,
                            go2@gene2name,
                            by = c("gene"="ENTREZID"))
# remove NA value if exist
go.list <- go.list[complete.cases(go.list[, c("gene", "SYMBOL")]), ]
head(go.list)
# gsid                             name gene  SYMBOL
# 2 GO:0000002 mitochondrial genome maintenance  142   PARP1
# 3 GO:0000002 mitochondrial genome maintenance  291 SLC25A4
# 4 GO:0000002 mitochondrial genome maintenance 1763    DNA2
# 5 GO:0000002 mitochondrial genome maintenance 1890    TYMP
# 6 GO:0000002 mitochondrial genome maintenance 2021   ENDOG
# 7 GO:0000002 mitochondrial genome maintenance 3980    LIG3

# convert to list required by irGSEA package
go.list$name <- factor(go.list$name)
go.list <- go.list %>% 
  dplyr::group_split(name, .keep = F) %>%
  purrr::map( ~.x %>% dplyr::pull(SYMBOL) %>% unique(.)) %>%
  purrr::set_names(levels(go.list$name))
head(go.list)
#### GO-BP基因集打分 ####
pbmc3k.final <- irGSEA.score(object = pbmc3k.final, assay = "RNA", slot = "data",
                             custom = T, geneset = go.list, 
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             kcdf = 'Gaussian')

4.irGSEA的完整流程

library(Seurat)
library(SeuratData)
library(RcppML)
library(irGSEA)
data("pbmc3k.final")
# 基因集打分
pbmc3k.final <- irGSEA.score(object = pbmc3k.final,assay = "RNA", 
                             slot = "data", seeds = 123, ncores = 1,
                             min.cells = 3, min.feature = 0,
                             custom = F, geneset = NULL, msigdb = T, 
                             species = "Homo sapiens", category = "H",  
                             subcategory = NULL, geneid = "symbol",
                             method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"),
                             aucell.MaxRank = NULL, ucell.MaxRank = NULL, 
                             kcdf = 'Gaussian')
# 计算差异基因集，并进行RRA
# 如果报错，考虑加句代码options(future.globals.maxSize = 100000 * 1024^5)
result.dge <- irGSEA.integrate(object = pbmc3k.final,
                               group.by = "seurat_annotations",
                               method = c("AUCell","UCell","singscore","ssgsea", "JASMINE", "viper"))
# 查看RRA识别的在多种打分方法中都普遍认可的差异基因集
geneset.show <- result.dge$RRA %>% 
  dplyr::filter(pvalue <= 0.05) %>% 
  dplyr::pull(Name) %>% unique(.)

可视化展示

1)全局展示

①热图

你还可以把method从'RRA"换成“ssgsea”，展示特定基因集富集分析方法中差异上调或差异下调的基因集；

irGSEA.heatmap.plot <- irGSEA.heatmap(object = result.dge,
                                      method = "RRA",
                                      top = 50,
                                      show.geneset = NULL)

irGSEA.heatmap.plot

默认展示前50，你也可以展示你想展示的基因集。例如，我想展示RRA识别的差异基因集。

irGSEA.heatmap.plot1 <- irGSEA.heatmap(object = result.dge, 
                                       method = "RRA",
                                       show.geneset = geneset.show)
irGSEA.heatmap.plot1

②气泡图

irGSEA.bubble.plot <- irGSEA.bubble(object = result.dge,
                                    method = "RRA",
                                     show.geneset = geneset.show)
irGSEA.bubble.plot

③upset plot

upset图展示了综合评估中每个细胞亚群具有统计学意义差异的基因集的数目，以及不同细胞亚群之间具有交集的差异基因集数目；

irGSEA.upset.plot <- irGSEA.upset(object = result.dge, 
                                  method = "RRA",
                                  mode = "intersect",
                                  upset.width = 20,
                                  upset.height = 10,
                                  set.degree = 2,
                                  pt_size = grid::unit(2, "mm"))
irGSEA.upset.plot

左边不同颜色的条形图代表不同的细胞亚群；上方的条形图代表具有交集的差异基因集的数目；中间的气泡图单个点代表单个细胞亚群，多个点连线代表多个细胞亚群取交集（）这里只展示两两取交集；

④堆叠条形图

堆叠柱状图具体展示每种基因集富集分析方法中每种细胞亚群中上调、下调和没有统计学差异的基因集数目；

irGSEA.barplot.plot <- irGSEA.barplot(object = result.dge,
                                      method = c("AUCell", "UCell", "singscore",
                                                 "ssgsea", "JASMINE", "viper", "RRA"))
irGSEA.barplot.plot

上方的条形代表每个亚群中不同方法中差异的基因数目，红色代表上调的差异基因集，蓝色代表下调的差异基因集；中间的柱形图代表每个亚群中不同方法中上调、下调和没有统计学意义的基因集的比例；

2）局部展示

①密度散点图

密度散点图将基因集的富集分数和细胞亚群在低维空间的投影结合起来，展示了特定基因集在空间上的表达水平。

scatterplot <- irGSEA.density.scatterplot(object = pbmc3k.final,
                            method = "UCell",
                            show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE",
                            reduction = "umap")

scatterplot

其中，颜色越黄，密度分数越高，代表富集分数越高；

②半小提琴图

半小提琴图同时以小提琴图（左边）和箱线图（右边）进行展示。不同颜色代表不同的细胞亚群；

halfvlnplot <- irGSEA.halfvlnplot(object = pbmc3k.final,
                                  method = "UCell",
                                  show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")

halfvlnplot

除此之外，还可以

vlnplot <- irGSEA.vlnplot(object = pbmc3k.final,
                          method = c("AUCell", "UCell", "singscore", "ssgsea", 
                                     "JASMINE", "viper"),
                          show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
vlnplot

③山峦图

山峦图中上方的核密度曲线展示了数据的主要分布，下方的条形编码图展示了细胞亚群具体的数量。不同颜色代表不同的细胞亚群，而横坐标代表不同的表达水平；

ridgeplot <- irGSEA.ridgeplot(object = pbmc3k.final,
                              method = "UCell",
                              show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
ridgeplot

④密度热图

密度热图展示了具体差异基因在不同细胞亚群中的表达和分布水平。颜色越红，代表富集分数越高；

densityheatmap <- irGSEA.densityheatmap(object = pbmc3k.final,
                                        method = "UCell",
                                        show.geneset = "HALLMARK-INFLAMMATORY-RESPONSE")
densityheatmap

参考资料

[1]

AUCell: https://doi.org/10.1038/nmeth.4463

[2]

UCell: https://doi.org/10.1016/j.csbj.2021.06.043

[3]

singscore: https://doi.org/10.1093/nar/gkaa802

[4]

ssgsea: https://doi.org/10.1038/nature08460

[5]

JASMINE: https://doi.org/10.7554/eLife.71994

[6]

VAM: https://doi.org/10.1093/nar/gkaa582

[7]

scSE: https://doi.org/10.1093/nar/gkz601

[8]

VISION: https://doi.org/10.1038/s41467-019-12235-0

[9]

wsum: https://doi.org/10.1093/bioadv/vbac016

[10]

wmean: https://doi.org/10.1093/bioadv/vbac016

[11]

mdt: https://doi.org/10.1093/bioadv/vbac016

[12]

viper: https://doi.org/10.1038/ng.3593

[13]

GSVApy: https://doi.org/10.1038/ng.3593

[14]

gficf: https://doi.org/10.1093/nargab/lqad024

[15]

GSVA: https://doi.org/10.1186/1471-2105-14-7

[16]

zscore: https://doi.org/10.1371/journal.pcbi.1000217

[17]

plage: https://doi.org/10.1186/1471-2105-6-225

[18]

ssGSEApy: https://doi.org/10.1093/bioinformatics/btac757

[19]

viperpy: https://doi.org/10.1093/bioinformatics/btac757

[20]

AddModuleScore: https://doi.org/10.1126/science.aad0501

[21]

pagoda2: https://doi.org/10.1038/nbt.4038

[22]

Sargent: https://doi.org/10.1016/j.mex.2023.102196

文末交流群

R包开发是学术性质，也是公益的，所以交流群并不会设置门票也不会有二次收费。但是希望是一些高质量数据分析实战派小伙伴进群交流，可以是关于irGSEA的开发建议和意见，比如其它 Functional Class Scoring (FCS)方法，其它可视化方法，其它统计学算法的植入。如果仅仅是有关于irGSEA的使用方法疑问，可以直接看官方文档即可哈。

ixxmu / mp_duty