空转数据分析之生态位聚类算法Banksy

空转数据分析之生态位聚类算法Banksy by 生信随笔

传统的单细胞聚类算法，例如Seurat和Scanpy Pipeline中的Louvain和Leiden等聚类算法，通常在处理空间数据时忽略了空间信息。然而，由于细胞状态受其周围细胞的影响，将转录组数据与细胞的空间信息结合起来进行聚类分析，将更有助于揭示细胞在组织中的分布和相互作用。

2024年2月，来自Shyam Prabhakar 和Kok Hao Chen研究团队的Vipul Singhal、Nigel Chou和Joseph Lee等人在Nature Genetics上发表了一篇题为《BANKSY unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis》（https://www.nature.com/articles/s41588-024-01664-3）的研究文章。这项工作开发了一种名为BANKSY的空间聚类算法。BANKSY可通过结合细胞/Spots的转录组信息和其局部微环境信息，显著提升空间组学数据分析的准确性和效率。BANKSY算法不仅能够处理大规模数据集，提高细胞类型聚类和组织域分割的性能，还展现了在多种RNA和蛋白质数据集上的广泛适用性。

具体来说，BANKSY是一种聚类空间组学数据的方法，通过将每个细胞的特征与其空间近邻特征的平均值以及邻域特征梯度相结合来增强。通过将邻域信息纳入聚类中，BANKSY能够：

在嘈杂的数据中改善细胞/Spots类型分配；
区分微环境分层的微妙不同细胞/Spots类型；
确定共享相同微环境的空间域。

BANKSY官方教程详见：

github：https://github.com/prabhakarlab/Banksy/
R语言版本：https://prabhakarlab.github.io/Banksy/
Python版本：https://github.com/prabhakarlab/Banksy_py

一. BANKSY算法的工作流程

图例：a, The original gene–cell expression matrix (purple) was augmented with neighborhood-averaged expression matrices corresponding to the mean local expression (dark pink) and the AGF (light pink). Here, λ is a mixing parameter that controls the importance of cells’ own expression and neighborhood expression effects, G(r) is a radially symmetric Gaussian kernel that decays from magnitude 1 at r = 0, ‘expression’ refers to each gene’s expression level in each cell, the mean is taken over cells in the respective index cell’s neighborhood and the eiϕ**G(r) term confers gradient sensitivity to the AGF. b, Heatmap of the real and imaginary components of a gradient-sensitive AGF kernel. The plots show an unnormalized AGF kernel: the real part (cos⁡(𝜙)) senses the gradient along the x axis and the imaginary part (sin⁡(𝜙)) senses the gradient along the y axis. c, Simplified schematic of two distinct cell types in the neighbor-augmented space. The neighbor expression features, representing the local microenvironment, help to separate two clusters that would be difficult to separate based on the cells’ own expression alone. For simplicity, we show ‘pure’ microenvironments containing only a single cell type (cell type 1 in zone A and cell type 2 in zone B), although BANKSY is equally applicable to heterogeneous microenvironments containing mixtures of cell types (Extended Data Fig. 1).

一种空间聚类的策略是将细胞的空间坐标附加到它们的基因表达向量中。然而，这会导致空间上相距较远的细胞分入不同的聚类中，即使它们具有相同的转录组。一种更直观的方法是附加某种表示细胞微环境的表示。BANKSY算法采用平均邻域表达和AGF（图1a,1b）来代表每个细胞周围的转录组微环境。重要的是，AGF（图1b）可以被视为衡量每个细胞邻域内基因表达梯度，对样本旋转是不变的。接下来，这些额外的特征被用来将细胞嵌入到一个邻域增强的乘积空间中（图1c）。经过降维处理，然后在结果嵌入空间中构建图形，可以使用任何图划分算法进行聚类。默认情况下，BANKSY使用Leiden社区检测算法，因为其速度和可扩展性，虽然也提供了其他方法（Louvain、基于模型的聚类和k-means）作为选项。

为了控制细胞自身和邻域特征对嵌入中细胞-细胞距离的相对贡献，BANKSY使用一个混合参数，λ∈ [0, 1]，来权衡细胞转录组矩阵和邻域表达矩阵（平均值和AGF；图1）的贡献。较小的λ设置强调细胞自身的转录组，因此导致细胞根据细胞类型进行聚类。在极限情况下，当λ = 0时，BANKSY简化为传统的非空间聚类。通过增加λ，增加邻域签名的影响，可以将BANKSY从细胞类型聚类模式切换到组织域分割模式，导致细胞根据组织域进行聚类。

引用自【Nat.Genet. | BANKSY：基于邻域核和空间组学的可扩展分析算法统一细胞分型和组织域分割，解析局部微环境】
详见NG文章的结果和方法部分：https://www.nature.com/articles/s41588-024-01664-3

二. BANKSY的R代码实现

Step0. R包安装

BANKSY对R语言版本要求比较高：

教程要求R语言版本 >= 4.4.0

#BiocManager::install("SpatialExperiment")
BiocManager::install('Banksy')

如果R版本是4.3.0的可以通过github安装（https://github.com/prabhakarlab/Banksy/issues/13）

#BiocManager::install("SpatialExperiment")
remotes::install_github("prabhakarlab/Banksy")

官方教程提供了快速测试的代码（https://prabhakarlab.github.io/Banksy/index.html）：

Step1. 加载R包和示例数据

library(Banksy)

library(SummarizedExperiment)
library(SpatialExperiment)
library(scuttle)

library(scater)
library(cowplot)
library(ggplot2)

测试数据是经典的小鼠海马10X visium空转数据。

data(hippocampus)
gcm <- hippocampus$expression
gcm[1:10, 1:10]
locs <- as.matrix(hippocampus$locations)

Banksy需要用户将数据转化为SpatialExperiment对象：

se <- SpatialExperiment(assay = list(counts = gcm), 
                        spatialCoords = locs)

Step2. 质控和标准化分析

这里如果已经走过Seurat的质控流程，可以跳过QC部分。

# QC based on total counts
qcstats <- perCellQCMetrics(se)
thres <- quantile(qcstats$total, c(0.05, 0.98))
keep <- (qcstats$total > thres[1]) & (qcstats$total < thres[2])
se <- se[, keep]

# Normalization to mean library size
se <- computeLibraryFactors(se)
aname <- "normcounts"
assay(se, aname) <- normalizeCounts(se, log = FALSE)

Step3. 计算BANKSY的邻域矩阵

设置compute_agf=TRUE计算加权邻域平均值（ℳ）和方位Gabor滤波器（𝒢）。用于计算ℳ和𝒢的空间近邻个数分别为k_geom[1]=15和k_geom[2]=30。我们在lambda=0处运行BANKSY，对应于非空间聚类;lambda=0.2对应于BANKSY空间细胞聚类。

lambda <- c(0, 0.2)
k_geom <- c(15, 30)

se <- Banksy::computeBanksy(se, assay_name = aname, compute_agf = TRUE, k_geom = k_geom)

Step4. PCA、降维和聚类

接下来，在BANKSY矩阵上运行PCA并运行聚类。设置use_agf=TRUE使用ℳ和𝒢构建BANKSY矩阵。

set.seed(1000)
se <- Banksy::runBanksyPCA(se, use_agf = TRUE, lambda = lambda)
se <- Banksy::runBanksyUMAP(se, use_agf = TRUE, lambda = lambda)
se <- Banksy::clusterBanksy(se, use_agf = TRUE, lambda = lambda, resolution = 1.2)

可以使用connectClusters函数将不同的聚类结果重新标记，以最小化它们之间的差异。

se <- Banksy::connectClusters(se)
#> clust_M1_lam0.2_k50_res1.2 --> clust_M1_lam0_k50_res1.2

Step5. 可视化

可视化非空间聚类(lambda=0)和BANKSY聚类(lambda=0.2)的聚类结果：

cnames <- colnames(colData(se))
cnames <- cnames[grep("^clust", cnames)]
colData(se) <- cbind(colData(se), spatialCoords(se))

plot_nsp <- plotColData(se,
    x = "sdimx", y = "sdimy",
    point_size = 0.6, colour_by = cnames[1]
)
plot_bank <- plotColData(se,
    x = "sdimx", y = "sdimy",
    point_size = 0.6, colour_by = cnames[2]
)


plot_grid(plot_nsp + coord_equal(), plot_bank + coord_equal(), ncol = 2)

我们还可以分别可视化每个亚群：

plot_grid(
    plot_nsp + facet_wrap(~colour_by),
    plot_bank + facet_wrap(~colour_by),
    ncol = 2
)

可视化比较非空间和BANKSY的umap结果：

rdnames <- reducedDimNames(se)

umap_nsp <- plotReducedDim(se,
    dimred = grep("UMAP.*lam0$", rdnames, value = TRUE),
    colour_by = cnames[1]
)
umap_bank <- plotReducedDim(se,
    dimred = grep("UMAP.*lam0.2$", rdnames, value = TRUE),
    colour_by = cnames[2]
)
plot_grid(
    umap_nsp,
    umap_bank,
    ncol = 2
)

三. BANKSY的函数和关键参数

3.1 R包BANKSY封装的函数

Neighborhood matrix computation	Annotation
`computeBanksy()`	Compute the component neighborhood matrices for the BANKSY matrix.
`getBanksyMatrix()`	Builds the BANKSY matrix from neighborhood matrices.
Dimensionality reduction
`runBanksyPCA()`	Run PCA on a BANKSY matrix.
`runBanksyUMAP()`	Run UMAP on a BANKSY embedding.
Clustering
`clusterBanksy()`	Perform clustering in BANKSY's neighborhood-augmented feature space.
`connectClusters()`	Relabel cluster labels across parameter runs to maximise their similarity.
`compareClusters()`	Compare cluster outputs based on various clustering comparison measures.
`smoothLabels()`	k-Nearest neighbor cluster label smoothing.
`clusterNames()`	Get names of clustering runs.
Datasets
`hippocampus`	Mouse Hippocampus VeraFISH data
`rings`	An unrealistic simulation of spatially-resolved omics data.
Misc.
`simulateDataset()`	Simulate an unrealistic spatial omics dataset.

3.2 相关函数的关键参数

BANKSY的核心函数有一些关键参数，在这里作者做了详细的介绍（https://prabhakarlab.github.io/Banksy/articles/parameter-selection.html）：

3.2.1 `computeBanksy()`:

k_geom <- c(15, 30)
se <- computeBanksy(se, assay_name = aname, compute_agf = TRUE, k_geom = k_geom)

AGF usage: For characterising neighborhoods, BANKSY computes the weighted neighborhood mean (H_0) and the azimuthal Gabor filter (H_1), which estimates gene expression gradients. Setting compute_agf=TRUE computes both H_0 and H_1.
k-geometric: k_geom specifies the number of neighbors used to compute each H_m for m=0,1. If a single value is specified, the same k_geom will be used for each feature matrix. Alternatively, multiple values of k_geom can be provided for each feature matrix. Here, we use k_geom[1]=15 and k_geom[2]=30 for H_0 and H_1 respectively. More neighbors are used to compute gradients.

computeBanksy populates the assays slot with H_0 and H_1 in this instance:

se
#> class: SpatialExperiment 
#> dim: 120 10205 
#> metadata(1): BANKSY_params
#> assays(4): counts normcounts H0 H1
#> rownames(120): Sparcl1 Slc1a2 ... Notch3 Egfr
#> rowData names(0):
#> colnames(10205): cell_1276 cell_691 ... cell_11635 cell_10849
#> colData names(4): sample_id sdimx sdimy sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : sdimx sdimy
#> imgData names(1): sample_id

3.2.2 `runBanksyPCA()`

lambda <- c(0, 0.2)
se <- runBanksyPCA(se, use_agf = c(FALSE, TRUE), lambda = lambda, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000

The lambda parameter is a mixing parameter in [0,1] which determines how much spatial information is incorporated for downstream analysis. With smaller values of lambda, BANKY operates in cell-typing mode, while at higher levels of lambda, BANKSY operates in domain-finding mode. As a starting point, we recommend lambda=0.2 for cell-typing and lambda=0.8 for zone-finding. Here, we run lambda=0 which corresponds to non-spatial clustering, and lambda=0.2 for spatially-informed cell-typing. We compute PCs with and without the AGF (H_1).

runBanksyPCA populates the reducedDims slot, with each combination of use_agf and lambda provided.

reducedDimNames(se)
#> [1] "PCA_M0_lam0"   "PCA_M0_lam0.2" "PCA_M1_lam0"   "PCA_M1_lam0.2"

3.2.3 `clusterBanksy()`聚类参数

Next, we cluster the BANKSY embedding with Leiden graph-based clustering. This admits two parameters: k_neighbors and resolution. k_neighbors determines the number of k nearest neighbors used to construct the shared nearest neighbors graph. Leiden clustering is then performed on the resultant graph with resolution resolution. For reproducibiltiy we set a seed for each parameter combination.

k <- 50
res <- 1
se <- clusterBanksy(se, use_agf = c(FALSE, TRUE), lambda = lambda, k_neighbors = k, resolution = res, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000

clusterBanksy populates colData(se) with cluster labels:

colnames(colData(se))
#> [1] "sample_id"                "sdimx"                   
#> [3] "sdimy"                    "sizeFactor"              
#> [5] "clust_M0_lam0_k50_res1"   "clust_M0_lam0.2_k50_res1"
#> [7] "clust_M1_lam0_k50_res1"   "clust_M1_lam0.2_k50_res1"

Comparing cluster results

To compare clustering runs visually, different runs can be relabeled to minimise their differences with connectClusters:

se <- connectClusters(se)
#> clust_M1_lam0_k50_res1 --> clust_M0_lam0_k50_res1
#> clust_M0_lam0.2_k50_res1 --> clust_M1_lam0_k50_res1
#> clust_M1_lam0.2_k50_res1 --> clust_M0_lam0.2_k50_res1

Visualise spatial coordinates with cluster labels.

cnames <- colnames(colData(se))
cnames <- cnames[grep("^clust", cnames)]
cplots <- lapply(cnames, function(cnm) {
    plotColData(se, x = "sdimx", y = "sdimy", point_size = 0.1, colour_by = cnm) +
        coord_equal() +
        labs(title = cnm) +
        theme(legend.title = element_blank()) +
        guides(colour = guide_legend(override.aes = list(size = 2)))
})

plot_grid(plotlist = cplots, ncol = 2)

Compare all cluster outputs with compareClusters. This function computes pairwise cluster comparison metrics between the clusters in colData(se) based on adjusted Rand index (ARI):

compareClusters(se, func = "ARI")
#>                          clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                     0.67
#> clust_M0_lam0.2_k50_res1                  0.670                     1.00
#> clust_M1_lam0_k50_res1                    1.000                     0.67
#> clust_M1_lam0.2_k50_res1                  0.747                     0.87
#>                          clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                    0.747
#> clust_M0_lam0.2_k50_res1                  0.670                    0.870
#> clust_M1_lam0_k50_res1                    1.000                    0.747
#> clust_M1_lam0.2_k50_res1                  0.747                    1.000

or normalized mutual information (NMI):

compareClusters(se, func = "NMI")
#>                          clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                    0.741
#> clust_M0_lam0.2_k50_res1                  0.741                    1.000
#> clust_M1_lam0_k50_res1                    1.000                    0.741
#> clust_M1_lam0.2_k50_res1                  0.782                    0.915
#>                          clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                    0.782
#> clust_M0_lam0.2_k50_res1                  0.741                    0.915
#> clust_M1_lam0_k50_res1                    1.000                    0.782
#> clust_M1_lam0.2_k50_res1                  0.782                    1.000

See ?compareClusters for the full list of comparison measures.

四. 总结

相较于传统的单细胞/空转聚类算法，BANKSY在细胞/Spots的表达矩阵的基础上，进一步联合了空间坐标信息，这种策略能够有效提高细胞/Spots分类的准确性和效率，进而揭示细胞间的相互作用和微环境影响。此外，BANKSY在处理大规模数据集时运行速度非常快，且兼容多样本分析以及传统的整合分析算法如Harmony。最后，BANKSY适用于多种空间数据类型（例如 10x Visium、Slide-seq、MERFISH、CosMX、CODEX），当然也适合最新的10X Visium HD空转数据~

下面是来自推文【Nat.Genet. | BANKSY：基于邻域核和空间组学的可扩展分析算法统一细胞分型和组织域分割，解析局部微环境】的点评：

BANKSY算法的核心理念，即一个细胞的状态和功能可以通过其所处的微环境来更准确地识别，启发了未来算法开发的方向。在很多生物学研究中，细胞的空间位置对其功能的影响是不可忽视的，而BANKSY通过将空间邻域描述符与细胞自身的基因表达向量相结合，提供了一种全新的视角来理解细胞在组织中的角色。此外，BANKSY的可扩展性和对参数变化的鲁棒性，也为处理大规模空间组学数据集提供了有效的解决方案。这种将生物信息学与机器学习技术相结合的方法，不仅为空间组学数据分析提供了新的工具，也为未来在其他组学数据分析中整合不同类型信息提供了可能的方向。这种跨学科的融合，特别是在生物学和计算科学之间的交叉，将是推动生命科学研究进步的关键。

ixxmu / mp_duty