肿瘤异质性强？试试这个样本整合包！ by 生信菜鸟团

Poisson scRNA Integration of Mixed Unknown Signals（PRIMUS）：混杂未知信号的泊松scRNA数据整合

背景介绍

PRIMUS 来源于一篇Sci Adv的文献Longitudinal single-cell RNA-seq analysis reveals stress-promoted chemoresistance in metastatic ovarian cancer[1]。作者期望通过研究化疗前后卵巢癌样本的单细胞转录组层面的变化，然而由于卵巢癌肿瘤特有的异质性，不同肿瘤之间细胞成分差异很大，多样本之间整合存在困难。而基于以往的Seurat v3, Harmony, LIGER, mnnCorrect, and fastMNN 以及三种Bulk RNAseq的整合方法（ComBat、ComBat-seq，and limma）整合仍然不能很好的解决这个问题。基于此，作者团队开发了PRIMUS。

PRIMUS是一种整体聚类方法，能够从 scRNA-seq 数据中识别表型细胞群，同时考虑日期来源(例如患者，样本，数据集)特定的组分以及技术噪声。PRIMUS 采用双线性泊松回归模型，将表达数据同时分解为明确的干扰因素（defined nuisance factors）、未定义的细胞表型及其相应的转录组学特征。

[1] Zhang K.et al.Longitudinal single-cell RNA-seq analysis reveals stress-promoted chemoresistance in metastatic ovarian cancer. Sci Adv. 2022 Feb 25;8(8):eabm1831. doi: 10.1126/sciadv.abm1831. Epub 2022 Feb 23. PMID: 35196078

PRIMUS 安装

PRIMUS 存储在Github，因此，可以使用devtools 安装。

devtools::install_github("KaiyangZ/PRIMUS")

Vignette 文档链接：https://htmlpreview.github.io/?https://github.com/KaiyangZ/PRIMUS/blob/master/vignettes/quickstart.html

PRIMUS 流程使用

数据准备

在上期我们说过，一般情况下作者都会提供包的使用方法及示例数据，但事情总有例外。PRIMUS R包的示例数据就如此。作者提供的Counts 文件和Meta 文件都不明原因丢失了，这个时候就需要自己下载其他的示例文件。在这里，我选择作者文章中用到的胰腺癌数据（https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/tree/master/Data/dataset4）作为示例数据进行演示。

图2 Methods 截图

而Github 页面显示，该数据存放在Docker，

图3 Github 页面截图

点开链接，发现这里用到了sudo 命令，一般情况下，我们使用的linux 服务器很难有sudo 权限。在我用的服务器中，singularity 替代了docker 的用法。

因此，这里可以使用singularity 获取胰腺癌数据。singularity 学习地址（http://www.xtaohub.com/Container-Tech/Singularity-in-nutshell.html）

singularity build --sandbox ./batcheffect  docker://jinmiaochenlab/batch-effect-removal-benchmarking
cp  ./batcheffect/batch_effect/dataset4 ./dataset4

scRNA 前期处理

数据读入

该胰腺癌数据来源于A benchmark of batch-effect correction methods for single-cell RNA sequencing data" (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9，感兴趣可以阅读原文献。在这里，我们先读入数据，并进行进行前期的简单处理，

library(SingleCellExperiment)
library(scuttle)
library(data.table)
library(dplyr)
rm(list = ls())
# 读入数据
counts = fread('./dataset4/myData_pancreatic_5batches.txt.gz') %>% as.data.frame()
rownames(counts) <- counts$V1; counts <- counts[,2:ncol(counts)]
dim(counts)
#> [1] 15558 14767
head(counts)[1:5,1:2]
#>        human1_lib1.final_cell_0001 human1_lib1.final_cell_0002
#>  A1BG                             0                           0
#>  A1CF                             4                           0
#>  A2M                              0                           0
#>  A2ML1                            0                           0
#>  A4GALT                           0                           0
meta = fread('./dataset4/mySample_pancreatic_5batches.txt.gz',) %>% as.data.frame()
rownames(meta) <- meta$V1; meta <- meta[,2:ncol(meta)]
dim(meta)
#> [1] 14767     5
colnames(meta)
#>  [1] "batch"         "batchlb"       "celltype_orig" "cellname"      "celltype"
head(meta)[1:5,1:3]
#>                            batch  batchlb celltype_orig
#> human1_lib1.final_cell_0001     1 Baron_b1        acinar
#> human1_lib1.final_cell_0002     1 Baron_b1        acinar
#> human1_lib1.final_cell_0003     1 Baron_b1        acinar
#> human1_lib1.final_cell_0004     1 Baron_b1        acinar
#> human1_lib1.final_cell_0005     1 Baron_b1        acinar

细胞抽样

读入数据后简单查看了下数据，不难看出，该胰腺癌数据含有14767个细胞，5个样本，为了精简流程，减少运行时间，后续对样本进行随机抽样，以减少数据量。

### 抽样前查看各细胞分布 ###
table(meta$celltype, meta$batch) 
#>                  1    2    3    4    5
#>   acinar        958  219  185    6    0
#>   alpha        2326  812  886  190  886
#>   beta         2525  448  270  111  472
#>   delta         601  193  114    9   49
#>   ductal       1077  245  386   96    0
#>   endothelial   252   21   16    0    0
#>   epsilon        18    3    7    0    0
#>   gamma         255  101  197   18   85
#>   macrophage     55    0    0    0    0
#>   mast           25    0    7    0    0
#>   mesenchymal     0   80    0   27    0
#>   MHC class II    0    0    5    0    0
#>   schwann        13    0    0    0    0
#>   stellate      457    0   54    0    0
#>   t_cell          7    0    0    0    0

### 由于部分细胞数稀少，因此这里考虑按每样本抽300细胞作为示例文件 ###，
allCells=rownames(meta)
allbatch = levels(factor(meta$batch))
choose_Cells = unlist(lapply(allbatch, function(x){
  cgCells = allCells[meta$batch == x]
  cg=sample(cgCells,300,replace = F)  
  cg
}))
Ct <- counts[,choose_Cells]
mt <- meta[choose_Cells,]
### 查看抽样结果 ##
table(mt$celltype, mt$batch) 
#>                  1   2   3   4   5
#>   acinar        21  28  32   5   0
#>   alpha         93 116 120 122 178
#>   beta          96  61  42  78  98
#>   delta         18  25  16   5   7
#>   ductal        34  38  49  62   0
#>   endothelial   13   4   3   0   0
#>   epsilon        2   0   1   0   0
#>   gamma          7  17  29  10  17
#>   macrophage     3   0   0   0   0
#>   mast           1   0   2   0   0
#>   mesenchymal    0  11   0  18   0
#>   MHC class II   0   0   1   0   0
#>   schwann        1   0   0   0   0
#>   stellate      11   0   5   0   0

T 细胞因为数量是在是太少了，一个都没有抽到，不过不影响我们后续演示。

创建SingleCellExperiment 对象

PRIMUS 全程使用的是 SingleCellExperiment 对象，这里也使用该对象。

simData = logNormCounts(SingleCellExperiment(assays = list(counts = Ct )))
simData
#>  class: SingleCellExperiment 
#>  dim: 15558 1500 
#>  metadata(0):
#>  assays(2): counts logcounts
#>  rownames(15558): A1BG A1CF ... ZZEF1 ZZZ3
#>  rowData names(0):
#>  colnames(1500): human3_lib3.final_cell_0151 human3_lib4.final_cell_0497 ... Sample_396 Sample_698
#>  colData names(1): sizeFactor
#>  reducedDimNames(0):
#>  mainExpName: NULL
#>  altExpNames(0):
# 这里将sizeFactor 存放到mt数据中，后续需要该数据
mt$sizeFactor  <- colData(simData)$sizeFactor

作者在原文中提及的是，胰腺癌数据使用PRISM 包（https://bitbucket.org/anthakki/prism/src/master/）计算 scaling factor ,模拟数据使用 logNormCounts 计算的结果。这里另外计算实在是有点烦，我直接用了logNormCounts的结果。

umap 可视化

展示未进行 PRIMUS 的umap结果。

library(uwot)
library(ggplot2)
library(cowplot)

umap.coor.raw <- uwot::umap(t(simData@assays@data$logcounts),
                            n_components = 2, metric = "cosine",
                            min_dist = 0.3, n_neighbors = 30)

mt$UMAP_raw_1 = umap.coor.raw[, 1]
mt$UMAP_raw_2 = umap.coor.raw[, 2]

umap.raw.g <- ggplot(mt, aes(x= UMAP_raw_1, y=UMAP_raw_2, color = celltype) ) +
  geom_point(size = 0.5) +
  theme_classic() +
  xlab("UMAP_1") + 
  ylab("UMAP_2") +
  ggtitle("colored by phenotypic cell group") +
  guides(colour = guide_legend(override.aes = list(size=3)), shape = guide_legend(override.aes = list(size=3)))

umap.raw.s <- ggplot(mt, aes(x= UMAP_raw_1, y=UMAP_raw_2, color = batch) ) +
  geom_point(size = 0.5) +
  theme_classic() +
  xlab("UMAP_1") + 
  ylab("UMAP_2") +
  ggtitle("colored by sample") +
  guides(colour = guide_legend(override.aes = list(size=3)), shape = guide_legend(override.aes = list(size=3)))

cowplot::plot_grid(umap.raw.s, umap.raw.g, nrow = 1)

图4 初始 umap 展示（sample+celltype）

PRIMUS 运算

runPrimus函数从scRNA-seq数据中识别表型细胞群，同时考虑到干扰因素（这里是指样本的batch effect）。作为输入，runPrimus需要原始计数矩阵（matrix）、指定样本标签的邻接矩阵和每个细胞的大小系数。runPrimus返回一个包含识别的细胞Cluster、每个群的去噪轮廓中心点（the denoised profile centroids）、样本特定效应和模型拟合成本的列表（List）。我们可以通过多次随机重启运行PRIMUS，并选择最佳拟合（成本最低的那个）。

运算

library(PRIMUS)

# 将样本标签转换为邻近矩阵
labels2weights <- function(L, levels = sort(unique(L)))
  return (outer(L, levels, `==`) + 0L)

D = t(labels2weights(mt$batch))

# 随机运行PRIMUS 5次
# k是聚类的数量。用户可以设置不同的值，并根据BIC选择最佳的k。这里先设置k =5 
fits <- list()

Ct <- as.matrix(Ct)
for (i in seq.int(5) ){
  set.seed(i * 1234)
    fits[[i]] = runPrimus(Y = Ct, D = D, g = mt$sizeFactor, k = 5, max.iter = 100)
}

fit = fits[[which.min(sapply(fits, function(x) x$cost))]]

library(mclust)

adjustedRandIndex(fit$L, mt$celltype)
#>  [1] 0.8846919    ## 模拟数据的结果是1，0.8 的结果 也很高了   
Z = sapply(seq.int(ncol(Ct)), function(i) primus_centroid(A = fit$X %*% D[, i, drop = F],
                                                          B = Ct[, i, drop = F], g = mt$sizeFactor[i]))

# take square root for variance-stabilizing
Z.sqrt = sqrt(Z)

可视化

#可视化


umap.coor.Z <- uwot::umap(t(Z.sqrt), n_components = 2,
                          metric = "cosine",  min_dist = 0.3,
                          n_neighbors = 30, 
                          y = as.factor(fit$L),
                          target_weight = 0.05)

mt$Z_UMAP_1 = umap.coor.Z[, 1]
mt$Z_UMAP_2 = umap.coor.Z[, 2]
mt$cluster = as.factor(fit$L)
mt$celltype
umap.Z.g <- ggplot(mt, aes(x= Z_UMAP_1, y=Z_UMAP_2, color = celltype) ) +
  geom_point(size = 0.5) +
  theme_classic() +
  ggtitle("colored by phenotypic cell group") + 
  guides(colour = guide_legend(override.aes = list(size=3)), shape = guide_legend(override.aes = list(size=3)))

umap.Z.s <- ggplot(mt, aes(x= Z_UMAP_1, y=Z_UMAP_2, color = batch) ) +
  geom_point(size = 0.5) +
  theme_classic() +
  ggtitle("colored by sample") + 
  guides(colour = guide_legend(override.aes = list(size=3)), shape = guide_legend(override.aes = list(size=3)))

umap.Z.c <- ggplot(mt, aes(x= Z_UMAP_1, y=Z_UMAP_2, color = cluster) ) +
  geom_point(size = 0.5) +
  theme_classic() +
  ggtitle("colored by PRIMUS identified clusters") +
  guides(colour = guide_legend(override.aes = list(size=3)), shape = guide_legend(override.aes = list(size=3)) )

cowplot::plot_grid(umap.Z.s, umap.Z.g, umap.Z.c, nrow = 1)

图5 k=5 时 umap图

k =5 时，几个大类的细胞能够很好的区分开，而少部分数量很少的细胞则成散点样在大细胞群周围分布。若设置k =8 则：

adjustedRandIndex(fit$L, mt$celltype)
#>  [1] 0.6460315

随后可视化（k=8）

图6 k=8 时umap 图

k=8 时，明显聚类的效果参差，但显然，大细胞群还是很好的区分开了，但小部分细胞群仍有聚集。

结束语

本篇简要介绍了文献：Longitudinal single-cell RNA-seq analysis reveals stress-promoted chemoresistance in metastatic ovarian cancer中开发的新多样本整合R包PRIMUS。从出发点来说，该包解决了多样本异质性的问题，某些细胞群只在单个样本中有，而在另一个样本中无，这种情况下的样本整合。根据原文献补充数据上的描述，Seurat v3 以及Harmony在这种情况下处理效果一般，由此说明PRIMUS 整合的优势。

图7 文献附图S2 I

K值的选择在PRIMUS的流程中至关重要，作者提及了BIC方法选择K值，但具体如何操作，目前还未能找到踪迹，后续学习补充之。

To select the optimal k, we fitted PRIMUS for k = 1,2, …,25 with 10 different random initial parameter sets for each k, and k = 12 was selected on the basis of BIC (fig. S2A).BIC ： Bayesian information criterion

S3 A Legend： Selection of the number of cancer cell clusters based on Bayesian information criterion (BIC). The boxplot showing the BIC of models fitted for 𝑘𝑘 = 1, 2, . . . , 25 with 10 different random initial parameter sets for each 𝑘. The red line indicates the selected number of clusters (𝑘 = 12).

- END -

ixxmu / mp_duty

肿瘤异质性强？试试这个样本整合包！ #3062