端到端的单细胞管道SCP-整合流程 by 单细胞天地

分享是一种态度

举杯邀明月，对影成三人

本章介绍SCP中对于单细胞数据的整合流程，适用于批次效应显著的多样本或多批次数据。

主要函数： Integration_SCP；
SCP版本：0.5.6；Seurat版本：v4.4.0；

Integration_SCP函数

Integration_SCP是针对具有批次效应单细胞数据整合流程。在前处理的一般步骤上与Standard_SCP相同（例如数据标准化、高变异基因检测等），其余步骤根据各批次整合算法的原理和官方教程进行了规范和调整。

整合流程共包括了一个非矫正（Uncorrected）流程，以及11个算法（Seurat，scVI，MNN，fastMNN，Harmony，Scanorama，BBKNN，CSS，LIGER，Conos，ComBat）的整合流程。

Integration_SCP函数在参数上整体与Standard_SCP相同，具体参数说明请查阅Integration_SCP函数文档^[1]，以下为一些特殊的参数选项：

输入可以是一个包含有批次信息的Seurat对象（srtMerge），也可以是一个包含了多个Seurat对象的list（srtList），两者均需要指定batch参数。注意当输入为list时，要确保list中的每个Seurat对象均包含有指定的batch信息。
HVF_source参数：HVF_source可以指定HVF的来源，选项分别为global或separete。
默认值：HVF_source="separate" HVF_min_intersection=1

HVF_source="global"，意味着将所有批次的细胞合并在一起，进行高变基因检测，也就是说在Seurat::FindVariableFeatures检测HVF过程里，将使用所有批次所有细胞基因的均值和方差拟合loess模型，这种方法得到的HVF一般会导致后续分析具有较大的批次效应；
HVF_source="separate"，意味着高变基因检测分别在各自批次的细胞中进行，然后使用Seurat::SelectIntegrationFeatures筛选用于整合的高变基因，此外还可以指定一个额外参数HVF_min_intersection，来进一步要求需要至少在N个批次中被检测为高变基因，N越大HVF数量越小，这种方法得到的HVF一般会减小批次效应（见后续）。

各整合算法一般会输出矫正后的embedding、graph、feature-cellmatrix至少一项，同时输出多项的时候按顺序优先使用，其中：

整合算法若只输出矫正后的feature-cellmatrix的话，意味着前处理没有进行线性降维，需要对矫正后矩阵进一步做线性降维，例如Seurat、MNN、ComBat等，此时linear_reduction_dims_use用于控制最终要使用的线性降维后向量空间的维度，默认将使用自动估计的内在维度；
矫正后的embedding理论上应当使用所有的dims（默认，一般情况下效果最好），当矫正效果不佳时也可以手动指定，例如scVI_dims_use、fastMNN_dims_use、Harmony_dims_use、Scanorama_dims_use、CSS_dims_use、LIGER_dims_use等；

除了一般步骤可以传递额外参数外，各整合算法主要函数的参数也可以通过Integration_SCP进行传递：

整合算法	参数名称	对应函数
Seurat	FindIntegrationAnchors_params	Seurat::FindIntegrationAnchors
Seurat	IntegrateData_params	Seurat::IntegrateData
Seurat	IntegrateEmbeddings_params	Seurat::IntegrateEmbeddings
scVI	SCVI_params	scvi.model.SCVI*
scVI	PEAKVI_params	scvi.model.PEAKVI*
MNN	mnnCorrect_params	batchelor::mnnCorrect
fastMNN	fastMNN_params	batchelor::fastMNN
Harmony	RunHarmony_params	harmony::RunHarmony
Scanorama	Scanorama_params	scanorama.correct*
BBKNN	bbknn_params	bbknn.matrix.bbknn*
CSS	CSS_params	simspec::cluster_sim_spectrum
LIGER	optimizeALS_params	rliger::optimizeALS
LIGER	quantilenorm_params	rliger::quantile_norm
Conos	buildGraph_params	conos::Conos
ComBat	ComBat_params	sva::ComBat

*表示该函数为python函数

函数示例

使用下采样后的8个成人胰腺单细胞数据集为例，该数据中共含有8个样本，5种单细胞建库技术平台，通过在R中运行?panc8_sub来查看该示例数据相关信息。

library(SCP)
library(Seurat)
data("panc8_sub")
panc8_sub
##> An object of class Seurat 
##> 13716 features across 1600 samples within 1 assay 
##> Active assay: RNA (13716 features, 0 variable features)
head(panc8_sub[[]])
##>          orig.ident nCount_RNA nFeature_RNA   tech replicate assigned_cluster
##> D71_58          D71   9915.908         3669 celseq    celseq             <NA>
##> D102_7         D102   8003.868         3300 celseq    celseq             <NA>
##> D3en4_73      D3en4   6643.200         3024 celseq    celseq             <NA>
##> D73_92          D73   6503.113         2901 celseq    celseq             <NA>
##> D1713_56      D1713  11674.000         4156 celseq    celseq             <NA>
##> D1713_19      D1713   5161.300         2150 celseq    celseq             <NA>
##>          celltype dataset
##> D71_58      alpha  celseq
##> D102_7       beta  celseq
##> D3en4_73   ductal  celseq
##> D73_92       beta  celseq
##> D1713_56   ductal  celseq
##> D1713_19     beta  celseq
table(panc8_sub$dataset, panc8_sub$tech)
##>             
##>              celseq celseq2 fluidigmc1 indrop smartseq2
##>   celseq        200       0          0      0         0
##>   celseq2         0     200          0      0         0
##>   fluidigmc1      0       0        200      0         0
##>   indrop1         0       0          0    200         0
##>   indrop2         0       0          0    200         0
##>   indrop3         0       0          0    200         0
##>   indrop4         0       0          0    200         0
##>   smartseq2       0       0          0      0       200

查看批次效应

我们可以用Integration_SCP中的Uncorrected流程观察批次效应，这里指定Integration_SCP中的待矫正的批次变量为tech：

panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Uncorrected"
)

#> [2023-10-30 09:17:39] Start Uncorrected_integrate
#> [2023-10-30 09:17:39] Spliting srtMerge into srtList by column tech... ...
#> [2023-10-30 09:17:40] Checking srtList... ...
#> Data 1/5 of the srtList is raw_normalized_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 1/5 of the srtList...
#> Data 2/5 of the srtList is raw_normalized_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 2/5 of the srtList...
#> Data 3/5 of the srtList is raw_normalized_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 3/5 of the srtList...
#> Data 4/5 of the srtList is raw_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 4/5 of the srtList...
#> Data 5/5 of the srtList is raw_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 5/5 of the srtList...
#> Use the separate HVF from srtList...
#> [2023-10-30 09:17:44] Finished checking.
#> [2023-10-30 09:17:45] Perform integration(Uncorrected) on the data...
#> [2023-10-30 09:17:45] Start Standard_SCP
#> [2023-10-30 09:17:45] Checking srtList... ...
#> Data 1/1 of the srtList has been log-normalized.
#> [2023-10-30 09:17:46] Finished checking.
#> [2023-10-30 09:17:46] Perform ScaleData on the data...
#> [2023-10-30 09:17:47] Perform linear dimension reduction (pca) on the data...
#> [2023-10-30 09:17:48] Perform FindClusters (louvain) on the data...
#> [2023-10-30 09:17:48] Reorder clusters...
#> [2023-10-30 09:17:48] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-10-30 09:17:57] Standard_SCP done
#> Elapsed time: 11.43 secs 
#> [2023-10-30 09:17:57] Uncorrected_integrate done
#> Elapsed time: 17.83 secs

CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))

注意在流程运行时的提示，panc8_sub各批次间细胞数据的处理程度是不同的，有两个数据集是原始counts（raw_counts），有三个数据集经过了标准化但是没有经过log转换（raw_normalized_counts，比如fpkm，tpm之类的数据），所以在前处理中均进行了NormalizeData(LogNormalize)操作，在之后的示例中该操作将不再进行。

结果中批次效应比较明显：各细胞类型均被tech所分割。使用”global”HVF或者Standard_SCP能观察到更大的批次效应，这两者的结果其实是一致的：

panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Uncorrected",
  HVF_source = "global"
)


p1 <- CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))
panc8_sub <- Standard_SCP(panc8_sub)


p2 <- CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))
p1 / p2

使用整合算法去除批次效应

整合算法的调用只需要在Integration_SCP中指定integration_method参数：

panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Seurat"
)



CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))

结果中的批次效应被矫正了：相同细胞类型成群，细胞群之间的区分也比较明显。返回的Seurat对象新增的graphs、reductions、clusters等名称前缀均为所使用的整合算法：

Graphs(panc8_sub)
##> [1] "Uncorrectedpca_KNN" "Uncorrectedpca_SNN" "Standardpca_KNN"   
##> [4] "Standardpca_SNN"    "Seurat_KNN"         "Seurat_SNN"
Reductions(panc8_sub)
##>  [1] "Uncorrectedpca"    "UncorrectedUMAP2D" "UncorrectedUMAP3D"
##>  [4] "Standardpca"       "StandardpcaUMAP2D" "StandardpcaUMAP3D"
##>  [7] "StandardUMAP2D"    "StandardUMAP3D"    "Seuratpca"        
##> [10] "SeuratUMAP2D"      "SeuratUMAP3D"
colnames(panc8_sub@meta.data)
##>  [1] "orig.ident"                 "tech"                      
##>  [3] "replicate"                  "assigned_cluster"          
##>  [5] "celltype"                   "dataset"                   
##>  [7] "Uncorrectedpca_SNN_res.0.6" "Uncorrectedclusters"       
##>  [9] "Standardpca_SNN_res.0.6"    "ident"                     
##> [11] "Standardpcaclusters"        "Standardclusters"          
##> [13] "nCount_RNA"                 "nFeature_RNA"              
##> [15] "Seurat_SNN_res.0.6"         "Seuratclusters"

对于一些整合流程而言（例如Seurat），可以尝试指定不同数量的HVF和dims：

panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Seurat",
  nHVF = 3000, linear_reduction_dims_use = 1:50
)


CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))

也可以通过传入额外参数来控制整合过程，例如使用Seurat的rpca整合流程^[2]:

panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Seurat",
  FindIntegrationAnchors_params = list(reduction = "rpca")
)

CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))

我们可以比较不同整合算法下的表现，这里为了保证可比性，对于所有在整合前或整合后使用线性降维的算法，统一使用前50个维度进行后续的非线性降维和聚类：

integration_methods <- c(
  "Uncorrected", "Seurat", "scVI", "MNN", "fastMNN", "Harmony",
  "Scanorama", "BBKNN", "CSS", "LIGER", "Conos", "ComBat"
)
plist1 <- list()
for (method in integration_methods) {
  panc8_sub <- Integration_SCP(
    srtMerge = panc8_sub, batch = "tech",
    integration_method = method,
    linear_reduction_dims_use = 1:50,
    nonlinear_reduction = "umap"
  )
  plist1[[method]] <- CellDimPlot(panc8_sub,
    group.by = "celltype",
    reduction = paste0(method, "UMAP2D"),
    xlab = "", ylab = "", title = method,
    legend.position = "none", theme_use = "theme_blank"
  )
}


patchwork::wrap_plots(plotlist = plist1)

与Standard_SCP略有不同的是，Integration_SCP单次只能使用一种线性降维方法以及多种非线性降维方法，以Seurat整合流程为例：

panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Seurat",
  linear_reduction_dims_use = 1:50,
  nonlinear_reduction = c("umap", "tsne", "dm", "phate", "pacmap", "trimap", "largevis", "fr")
)



plist2 <- lapply(c("umap", "tsne", "dm", "phate", "pacmap", "trimap", "largevis", "fr"), function(nr) {
  CellDimPlot(panc8_sub,
    group.by = "celltype",
    reduction = paste0("Seurat", nr, "2D"),
    xlab = "", ylab = "", title = nr,
    legend.position = "none",
    theme_use = "theme_blank"
  )
})
patchwork::wrap_plots(plotlist = plist2)

另外注意，BBKNN和Conos流程的整合输出均为graph，所以非线性降维方法只能使用umap或者fr，例如：

panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "BBKNN",
  linear_reduction_dims_use = 1:50,
  nonlinear_reduction = c("umap", "fr")
)



plist3 <- lapply(c("umap", "fr"), function(nr) {
  CellDimPlot(panc8_sub,
    group.by = "celltype",
    reduction = paste0("BBKNN", nr, "2D"),
    xlab = "", ylab = "", title = nr,
    legend.position = "none",
    theme_use = "theme_blank"
  )
})
patchwork::wrap_plots(plotlist = plist3)

合理地选择HVF可以减小批次效应

之前提到使用 HVF_source="separate" 结合HVF_min_intersection来检测HVF，用于下游分析时会减小批次效应。这里我们使用Uncorrected流程来说明HVF_min_intersection参数对结果的影响。因为数据包含了5种单细胞建库技术平台，HVF_min_intersection最小为1，最大为5：

plist4 <- list()
for (i in 1:5) {
  panc8_sub <- Integration_SCP(
    srtMerge = panc8_sub, batch = "tech",
    integration_method = "Uncorrected",
    HVF_min_intersection = i
  )
  plist4[[i]] <- CellDimPlot(panc8_sub,
    group.by = "celltype",
    reduction = "UncorrectedUMAP2D",
    xlab = "", ylab = "",
    title = paste0("HVF_min_intersection=", i),
    legend.position = "none",
    theme_use = "theme_blank"
  )
}


patchwork::wrap_plots(plotlist = plist4)

panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Uncorrected",
  HVF_min_intersection = 5,
  scale_within_batch = TRUE
)
#> [2023-11-21 16:45:52] Start Uncorrected_integrate
#> [2023-11-21 16:45:53] Spliting srtMerge into srtList by column tech... ...
#> [2023-11-21 16:46:37] Checking srtList... ...
#> Data 1/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 1/5 of the srtList...
#> Data 2/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 2/5 of the srtList...
#> Data 3/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 3/5 of the srtList...
#> Data 4/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 4/5 of the srtList...
#> Data 5/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 5/5 of the srtList...
#> Use the separate HVF from srtList...
#> [2023-11-21 16:46:43] Finished checking.
#> [2023-11-21 16:46:49] Perform integration(Uncorrected) on the data...
#> [2023-11-21 16:46:49] Perform ScaleData on the data...
#> [2023-11-21 16:46:50] Perform linear dimension reduction (pca) on the data...
#> [2023-11-21 16:46:51] Perform FindClusters (louvain) on the data...
#> [2023-11-21 16:46:51] Reorder clusters...
#> [2023-11-21 16:46:52] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-11-21 16:47:04] Uncorrected_integrate done
#> Elapsed time: 1.19 mins

CellDimPlot(panc8_sub,
  group.by = c("tech", "celltype"),
  title = "scale_within_batch"
)

图上可以看出，随着HVF_min_intersection升高，批次效应逐渐减小，HVF_min_intersection=5时批次效应几乎消除，如果再结合批次内的scale策略，此时结果已经媲美整合算法。

为什么不用整合算法，仅仅选择一部分HVF就可以达到这种效果？

实际上，一个数据集中的HVF反应的是整个数据集内的，也就是该数据集所有的细胞和细胞之间差异较大的基因（稳定方差后依然高于预期方差的基因），这些HVF可以用来表征、衡量和代表这个数据集。

什么样的基因在细胞之间变异较大？

细胞类型的主要marker，或者是细胞状态的marker（例如反应分化程度、细胞周期等相关的基因，表达较稳定）；
随机的，或是具有较大变异的、但无关紧要的噪声基因（可能是由于生物因素导致、也可能是细胞建库测序技术或其他批次因素导致的表达波动）等等。

对于每个数据集来讲，所检测到的HVF组成并不相同，尤其是在批次效应较大数据集之间。可以预期的是，大概率上，各批次中的噪声基因是不同的，但是细胞状态或细胞类型的组成有重叠甚至完全相同。

因此，当使用所有数据集HVF的交集时，保留下的基因大部分也就是各细胞类型、反应细胞主要特征的marker了，这些marker在各批次中的表达特异性是保守的（比如基因GCG在任何批次中均在Alpha细胞中最高表达），借助它们从而达到去除噪声、去除批次效应的效果。

进一步思考，如果先单独分析各数据集间的细胞类型，鉴定出批次之间保守的各细胞类型marker，用这些基因进行整合应该也可以达到去批次效应的目的。这种方式相对于全局的无监督整合方法而言，更加依赖于每个数据集内部的标注，所以是一种半监督整合（semi-supervisedintegration）。

那么简单地测试一下？

由于panc8_sub已经包含有细胞类型的信息，我们跳过单独分析的步骤，直接使用RunDEtest来检测介于各批次间的细胞类型保守marker（markers_type = "conserved"），然后筛选出变异较大的marker作为候选HVF（p_val_adj < 0.05 & avg_log2FC > 3）。RunDEtest可以系统鉴定差异基因、保守基因或者扰动基因等，在后续章节中将会详细说明该函数。

panc8_sub <- RunDEtest(panc8_sub, group_by = "celltype", grouping.var = "tech", markers_type = "conserved")


conserved <- dplyr::filter(panc8_sub@tools$DEtest_celltype$ConservedMarkers_wilcox, p_val_adj < 0.05 & avg_log2FC > 3)
ht <- GroupHeatmap(panc8_sub,
  slot = "data",
  features = conserved$gene, feature_split = conserved$group1,
  group.by = "tech", split.by = "celltype", within_groups = TRUE
)
ht$plot

使用这些批次间保守的细胞类型marker作为HVF，进行Uncorrected整合：

panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Uncorrected",
  HVF = conserved$gene
)


CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))

可以看见，使用了974个批次间保守的细胞类型marker，也可以消除批次效应，达到相同细胞类型被聚在一起的效果。

简化到极致呢？

选择更少、表达更特异的marker作为HVF，应当也可以达到整合的效果。

这里我们为各组选择top3的marker，实际情况中还可以手动为每种细胞类型指定特异的marker：

top <- conserved %>%
  dplyr::group_by(group1) %>%
  dplyr::top_n(3, avg_log2FC)
table(top$group1)
#> 
#>              alpha               beta             ductal             acinar 
#>                  3                  3                  3                  3 
#> activated_stellate              gamma              delta            epsilon 
#>                  3                  3                  3                  0 
#>               mast        endothelial         macrophage quiescent_stellate 
#>                  0                  3                  3                  3 
#>            schwann 
#>                  3

panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Uncorrected",
  HVF = top$gene,
  scale_within_batch = TRUE
)
#> [2023-11-21 16:48:36] Start Uncorrected_integrate
#> [2023-11-21 16:48:36] Spliting srtMerge into srtList by column tech... ...
#> [2023-11-21 16:49:19] Checking srtList... ...
#> Data 1/5 of the srtList has been log-normalized.
#> Data 2/5 of the srtList has been log-normalized.
#> Data 3/5 of the srtList has been log-normalized.
#> Data 4/5 of the srtList has been log-normalized.
#> Data 5/5 of the srtList has been log-normalized.
#> [2023-11-21 16:49:20] Finished checking.
#> [2023-11-21 16:49:25] Perform integration(Uncorrected) on the data...
#> [2023-11-21 16:49:25] Perform ScaleData on the data...
#> [2023-11-21 16:49:25] Perform linear dimension reduction (pca) on the data...
#> [2023-11-21 16:49:29] Perform FindClusters (louvain) on the data...
#> [2023-11-21 16:49:29] Reorder clusters...
#> [2023-11-21 16:49:30] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-11-21 16:49:42] Uncorrected_integrate done
#> Elapsed time: 1.1 mins
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))

实际上，因为使用HVF数量很少，已经不需要线性降维这一步骤了，直接使用这些HVF运行UMAP即可：

panc8_sub %>%
  RunUMAP(features = top$gene) %>%
  CellDimPlot(group.by = c("tech", "celltype"), reduction = "umap")

总结

单细胞数据中所存在的批次效应，反应出的是一些特征（基因）在批次间的表达波动。这类波动主要来源于各类非生物学因素，在分析中会造成许多干扰：

一方面，这些基因的波动会直接影响数据间的比较，需要借助整合算法来排除它们对于细胞邻近关系的干扰；
另一方面，在个别数据集中由这类基因所产生的结论是无法轻易外推的，因为它们极有可能是某些批次因素产生的不可重复的结果。

因此，批次效应相关的基因可以视为噪声，在数据集间的整合、比较或映射时需要排除其干扰，在其他下游分析中也需谨慎对待；

单细胞数据整合的目标是在相同类型或状态的细胞之间建立邻近关系，另外对于细胞谱系分化的数据集而言，例如胚胎发生或者精子发生过程，还需要进一步连接不同分化状态的细胞，从而保持细胞分群和分化路径在数据集间的一致性。

要实现这种统一，也就是要排除批次效应的影响，除了使用整合算法，最简单、最朴素的思想是利用一些关键的、批次间保守的特征（基因）来进行整合，正如我们所常做的，在整合后利用marker来确定各群细胞的细胞类型那样。本章节最后的例子也可以看出，仅使用几十个基因，便足以用一般的单细胞流程实现批次间的整合。

参考资料

[1]

Integration_SCP函数文档: https://zhanghao-njmu.github.io/SCP/reference/Integration_SCP.html

[2]

rpca整合流程: https://satijalab.org/seurat/articles/integration_rpca

往期回顾

如果你对单细胞转录组研究感兴趣，但又不知道如何入门，也许你可以关注一下下面的课程

看完记得顺手点个“在看”哦！

生物 | 单细胞 | 转录组丨资料

每天都精彩

长按扫码可关注

ixxmu / mp_duty

端到端的单细胞管道SCP-整合流程 #4147