这个bulk RNA-seq反卷积工具，你可能还不知道 by 生信菜鸟团

引入

今天分享的是一个bulk RNA-seq的反卷积工具 BayesPrism。

Chu, T., Wang, Z., Pe’er, D. et al. Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat Cancer 3, 505–517 (2022). https://doi.org/10.1038/s43018-022-00356-3

最初看到这个工具是在BioArt 公众号上（https://mp.weixin.qq.com/s/bQe41r0a20DI1_cFf_hn9A），当时并没有过多深入了解，对其印象局限于是一个借助于单细胞分群来推断Bulk RNAseq 细胞分群的工具。近期，健明老师突然提到这个工具，也促使我对这个R包感兴趣起来。就反卷积算法而言，正像之前那篇推文中提到的，以往算法存在局限性：

第一，由于单细胞和bulk 的RNA-seq通常基于不同的实验方法和测序平台，两者之间存在较大的技术性批次效应（technical batch effects）；第二，由于肿瘤异质性，不同样本间的基因表达差异较大，单细胞的参考矩阵与实际bulk RNA-seq中的基因表达往往存在生物学差异（biological variation）。

而本篇将要介绍的BayesPrism（贝叶斯棱镜）原理在该推文中概括如下：

贝叶斯棱镜仅把参考矩阵作为先验信息（prior information），在给定bulk RNA-seq数据后对该bulk RNA-seq中实际的参考矩阵进行建模。贝叶斯棱镜求得
1）每一例bulk样本中的每种细胞的百分比和 2）每种细胞的每个基因的reads数的联合后验概率分布。通过吉布斯采样（Gibbs sampling），可对1和2求得各自的边际（marginal）分布（图1）。由于对bulk RNA-seq中实际的参考矩阵进行建模并将其边际化，贝叶斯棱镜能对抗单细胞RNA-seq提取出的参考矩阵与bulk RNA-seq中的实际的参考矩阵的差别（robustness特性）。

图1. 流程图

原理大致如此，那么算法实现该当何如？

BayesPrism 安装与使用

安装

在作者的论文末尾，已经提供了BayesPrism github 的地址 https://github.com/Danko-Lab/BayesPrism.git.。正如以往github 包的安装方法,其也是通过devtools安装

library("devtools");
install_github("Danko-Lab/BayesPrism/BayesPrism")

如果说你在服务器上安装也遇到了我这样的安装问题（图2）

图2. devtools 安装报错

推荐解决方式是github 页面依次点击：Code → Download Zip → 上传服务器 → 解压 zip → 找到下级目录[BayesPrism](https://github.com/Danko-Lab/BayesPrism/tree/main/BayesPrism)→ 对其进行 Zip 压缩 → 以下代码：

devtools::install_local('./BayesPrism-main/BayesPrism.zip')

一般情况下也不会遇到这种问题，当然如果说你要是连github也进不去，那……我也没办法。

使用

在BayesPrism的github 文档中提供了演示数据，如果是devtools 安装的，则需要自行下载演示数据。如果是下载zip 文件后的，在你的zip 文件夹解压后也能找到该演示数据。以下我将使用该演示数据进行探索BayesPrism

导入演示数据

作者这里给定的Bulk RNAseq 数据来源于 TCGA-GBM, scRNA-seq 来源于Yuan的scRNA-seq数据集（https://www.yuque.com/u21872605/nils1z/dbsrgqum7g00db5o）。

suppressWarnings(library(BayesPrism))
#> Loading required package: snowfall
#> Loading required package: snow
#> Loading required package: NMF
#> Loading required package: registry
#> Loading required package: rngtools
#> Loading required package: cluster
#> NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 127/128
#>   To enable shared memory capabilities, try: install.extras('
#> NMF
#> ')
load("BayesPrism-main/tutorial.dat/tutorial.gbm.rdata")
ls()
#> [1] "bk.dat"            "cell.state.labels" "cell.type.labels"  "sc.dat"

rdata 文件包含运行 BayesPrism 所需的四个对象。

bk.dat：批量 RNA-seq 表达的逐个基因原始计数矩阵。rownames是批量样本 ID，而 colnames是基因名称/ID。

dim(bk.dat)
#> [1]   169 60483
head(rownames(bk.dat))
#> [1] "TCGA-06-2563-01A-01R-1849-01" "TCGA-06-0749-01A-01R-1849-01" "TCGA-06-5418-01A-01R-1849-01" "TCGA-06-0211-01B-01R-1849-01"
#> [5] "TCGA-19-2625-01A-01R-1850-01" "TCGA-19-4065-02A-11R-2005-01"
head(colnames(bk.dat))
#> [1] "ENSG00000000003.13" "ENSG00000000005.5"  "ENSG00000000419.11" "ENSG00000000457.12" "ENSG00000000460.15" "ENSG00000000938.11"

sc.dat: 大量 RNA-seq 表达的逐个基因原始计数矩阵。rownames是批量细胞 ID，而 colnames是基因名称/ID。

dim(sc.dat)
#> [1] 23793 60294
head(rownames(sc.dat))
#> [1] "PJ016.V3" "PJ016.V4" "PJ016.V5" "PJ016.V6" "PJ016.V7" "PJ016.V8"
head(colnames(sc.dat))
#> [1] "ENSG00000130876.10" "ENSG00000134438.9"  "ENSG00000269696.1"   "ENSG00000230393.1"  "ENSG00000266744.1"  "ENSG00000202281.1"

cell.type.labels是一个字符向量，其长度与nrow(sc.dat)表示参考中每个单元格的单元格类型相同。

sort(table(cell.type.labels))
#> cell.type.labels
#>       tcell       oligo    pericyte endothelial     myeloid       tumor 
#>          67         160         489         492        2505       20080

cell.state.labels是与表示参考中每个细胞的细胞状态相同长度的特征向量。

sort(table(cell.state.labels))
#> cell.state.labels
#> PJ017-tumor-6 PJ032-tumor-5     myeloid_8 PJ032-tumor-4 PJ032-tumor-3 
#>            22            41            49            57            62 
#> PJ017-tumor-5         tcell PJ032-tumor-2 PJ030-tumor-5     myeloid_7 
#>            64            67            72            73            75 
#> PJ035-tumor-8 PJ017-tumor-4 PJ017-tumor-3 PJ017-tumor-2 PJ017-tumor-0 
#>            81            83            89           101           107 
#> PJ017-tumor-1 PJ018-tumor-5     myeloid_6     myeloid_5 PJ035-tumor-7 
#>           107           113           130           141           150 
#>         oligo PJ048-tumor-8 PJ032-tumor-1 PJ032-tumor-0 PJ048-tumor-7 
#>           160           169           171           195           228 
#> PJ025-tumor-9 PJ035-tumor-6 PJ048-tumor-6 PJ016-tumor-6 PJ018-tumor-4 
#>           236           241           244           261           262 
#>     myeloid_4 PJ048-tumor-5 PJ048-tumor-4 PJ016-tumor-5 PJ025-tumor-8 
#>           266           277           303           308           319 
#> PJ048-tumor-3 PJ035-tumor-5 PJ030-tumor-4 PJ018-tumor-3 PJ016-tumor-4 
#>           333           334           348           361           375 
#> PJ025-tumor-7     myeloid_3 PJ035-tumor-4     myeloid_2 PJ025-tumor-6 
#>           381           382           385           386           397 
#> PJ018-tumor-2 PJ030-tumor-3 PJ016-tumor-3 PJ025-tumor-5 PJ030-tumor-2 
#>           403           419           420           421           425 
#> PJ018-tumor-1 PJ035-tumor-3 PJ048-tumor-2 PJ018-tumor-0 PJ048-tumor-1 
#>           429           435           437           444           463 
#> PJ035-tumor-2 PJ035-tumor-1 PJ016-tumor-2 PJ030-tumor-1      pericyte 
#>           471           474           481           482           489 
#>   endothelial PJ035-tumor-0 PJ025-tumor-4     myeloid_1 PJ048-tumor-0 
#>           492           512           523           526           545 
#>     myeloid_0 PJ030-tumor-0 PJ025-tumor-3 PJ016-tumor-1 PJ016-tumor-0 
#>           550           563           601           619           621 
#> PJ025-tumor-2 PJ025-tumor-1 PJ025-tumor-0 
#>           630           941           971

table 看一下cell.type.labels和cell.state.labels的关系。

table(cbind.data.frame(cell.state.labels, cell.type.labels))
#>                  cell.type.labels
#> cell.state.labels endothelial myeloid oligo pericyte tcell tumor
#>     endothelial           492       0     0        0     0     0
#>     myeloid_0               0     550     0        0     0     0
#>     myeloid_1               0     526     0        0     0     0
#>     myeloid_2               0     386     0        0     0     0
#>     myeloid_3               0     382     0        0     0     0
#>     myeloid_4               0     266     0        0     0     0
#>     myeloid_5               0     141     0        0     0     0
#>     myeloid_6               0     130     0        0     0     0
#>     myeloid_7               0      75     0        0     0     0
#>     myeloid_8               0      49     0        0     0     0
#>     oligo                   0       0   160        0     0     0
#>     pericyte                0       0     0      489     0     0
#>     PJ016-tumor-0           0       0     0        0     0   621
#>     PJ016-tumor-1           0       0     0        0     0   619
#>     PJ016-tumor-2           0       0     0        0     0   481
#>     PJ016-tumor-3           0       0     0        0     0   420
#>     PJ016-tumor-4           0       0     0        0     0   375
#>     PJ016-tumor-5           0       0     0        0     0   308
#>     PJ016-tumor-6           0       0     0        0     0   261
#>     PJ017-tumor-0           0       0     0        0     0   107
#>     PJ017-tumor-1           0       0     0        0     0   107
#>     PJ017-tumor-2           0       0     0        0     0   101
#>     PJ017-tumor-3           0       0     0        0     0    89
#>     PJ017-tumor-4           0       0     0        0     0    83
#>     PJ017-tumor-5           0       0     0        0     0    64
#>     PJ017-tumor-6           0       0     0        0     0    22
#>     PJ018-tumor-0           0       0     0        0     0   444
#>     PJ018-tumor-1           0       0     0        0     0   429
#>     PJ018-tumor-2           0       0     0        0     0   403
#>     PJ018-tumor-3           0       0     0        0     0   361
#>     PJ018-tumor-4           0       0     0        0     0   262
#>     PJ018-tumor-5           0       0     0        0     0   113
#>     PJ025-tumor-0           0       0     0        0     0   971
#>     PJ025-tumor-1           0       0     0        0     0   941
#>     PJ025-tumor-2           0       0     0        0     0   630
#>     PJ025-tumor-3           0       0     0        0     0   601
#>     PJ025-tumor-4           0       0     0        0     0   523
#>     PJ025-tumor-5           0       0     0        0     0   421
#>     PJ025-tumor-6           0       0     0        0     0   397
#>     PJ025-tumor-7           0       0     0        0     0   381
#>     PJ025-tumor-8           0       0     0        0     0   319
#>     PJ025-tumor-9           0       0     0        0     0   236
#>     PJ030-tumor-0           0       0     0        0     0   563
#>     PJ030-tumor-1           0       0     0        0     0   482
#>     PJ030-tumor-2           0       0     0        0     0   425
#>     PJ030-tumor-3           0       0     0        0     0   419
#>     PJ030-tumor-4           0       0     0        0     0   348
#>     PJ030-tumor-5           0       0     0        0     0    73
#>     PJ032-tumor-0           0       0     0        0     0   195
#>     PJ032-tumor-1           0       0     0        0     0   171
#>     PJ032-tumor-2           0       0     0        0     0    72
#>     PJ032-tumor-3           0       0     0        0     0    62
#>     PJ032-tumor-4           0       0     0        0     0    57
#>     PJ032-tumor-5           0       0     0        0     0    41
#>     PJ035-tumor-0           0       0     0        0     0   512
#>     PJ035-tumor-1           0       0     0        0     0   474
#>     PJ035-tumor-2           0       0     0        0     0   471
#>     PJ035-tumor-3           0       0     0        0     0   435
#>     PJ035-tumor-4           0       0     0        0     0   385
#>     PJ035-tumor-5           0       0     0        0     0   334
#>     PJ035-tumor-6           0       0     0        0     0   241
#>     PJ035-tumor-7           0       0     0        0     0   150
#>     PJ035-tumor-8           0       0     0        0     0    81
#>     PJ048-tumor-0           0       0     0        0     0   545
#>     PJ048-tumor-1           0       0     0        0     0   463
#>     PJ048-tumor-2           0       0     0        0     0   437
#>     PJ048-tumor-3           0       0     0        0     0   333
#>     PJ048-tumor-4           0       0     0        0     0   303
#>     PJ048-tumor-5           0       0     0        0     0   277
#>     PJ048-tumor-6           0       0     0        0     0   244
#>     PJ048-tumor-7           0       0     0        0     0   228
#>     PJ048-tumor-8           0       0     0        0     0   169
#>     tcell                   0       0     0        0    67     0

注：

BayesPrism 将对mixture(代指bulk RNA-seq)和 reference （代指scRNA-seq）之间共享的基因执行反卷积，即通过交叉 colnames(mixture) 和 colnames(reference)。因此，请确保使用一致的基因注释（TCGA RNA-seq 使用 GENCODE v22）。
建议使用未规范化和未转换（ unnormalized and untransformed）的count data。当原始计数不可用时，线性归一化（例如 TPM、RPM、RPKM、FPKM）也是可以接受的，因为 BayesPrism 对reference和mixture之间的线性乘法差异具有鲁棒性（换个单词即Robust）。理想情况下，如果使用规范化数据，最好提供参考数据和通过相同方法转换的批量数据。应避免log转换。
所有细胞群应具有一定的细胞数量，例如 >20 或 >50。

在这里，关于cell.type.labels和cell.state.labels我个人理解是后者可以是前者的亚群，简单来说，cell.type.labels是第一步分群的结果，cell.state.labels是在第一步分群的基础上，进一步根据群内异质性继续细分亚群的结果。作者对cell.type.labels和cell.state.labels的描述（翻译 By 彩云小译）：

细胞类型和细胞状态的定义可能有些随意（类似于为 scRNA-seq 分配细胞类型的问题）并且取决于感兴趣的问题。它们的定义取决于我们关注的细胞（the granularity we aim at ）和 scRNA-seq 数据中cell.type.labels 的置信度。通常，一个好的经验法则如下。
1)将细胞类型定义为比其他细胞类型具有足够数量的显著差异表达基因的细胞群，例如大于50甚至100个。对于转录过于相似的簇，我们建议将它们视为细胞状态，在最终的吉布斯取样之前将对它们进行总结。因此，细胞状态通常适合于在表型流形上形成连续统的细胞，而不是不同的簇。
2)为显著异质性的细胞类型(如恶性细胞)定义多种细胞状态，并对其转录进行解卷积。

细胞类型和状态标签的 QC

作者这一块说的很复杂，其实简单来说，cell.type.labels和cell.state.labels要有足够的区分度，如果区分度不够或者过了，那么各个群在聚类图中，可能会聚集在一起。如果出现大部分cluster 具有明显的相似性，可以考虑merge，或者以更低的resolution进行降维聚类（或者说higher granularity）。如果重新聚类或者merge 不合适的话，也可考虑删除，因此，作者建议，首先使用plot.cor.phi函数查看cell.type.labels和cell.state.labels的特征

plot.cor.phi (input=sc.dat,
              input.labels=cell.state.labels,
              title="cell state correlation",
              #specify pdf.prefix if need to output to pdf
              #pdf.prefix="gbm.cor.cs", 
              cexRow=0.2, cexCol=0.2,
              margins=c(2,2))

plot.cor.phi (input=sc.dat, 
              input.labels=cell.type.labels, 
              title="cell type correlation",
              #specify pdf.prefix if need to output to pdf
              #pdf.prefix="gbm.cor.ct",
              cexRow=0.5, cexCol=0.5,
)

过滤离群基因

在单细胞测序中，大量的核糖体基因或者线粒体基因不参与指导分析细胞功能，但其可能主导分布并使推断产生偏差。这些基因在区分细胞类型方面通常不能提供信息，并且可能是大的假变异的来源。因此，它们对反卷积是不利的。因此，有必要移除这些基因。

用户可以从 scRNA-seq 参考可视化异常基因的分布。计算所有细胞类型的平均表达和每个基因的表达，以及它们的细胞类型特异性评分。
从 scRNA-seq 数据中可视化并确定异常基因

#查看单细胞数据的离群基因
sc.stat <- plot.scRNA.outlier(
  input=sc.dat, #make sure the colnames are gene symbol or ENSMEBL ID 
  cell.type.labels=cell.type.labels,
  species="hs", #currently only human(hs) and mouse(mm) annotations are supported
  return.raw=TRUE #return the data used for plotting. 
  #pdf.prefix="gbm.sc.stat" specify pdf.prefix if need to output to pdf
)
head(sc.stat)  
#>                    exp.mean.log  max.spec other_Rb  chrM  chrX  chrY    Rb   Mrp   act    hb MALAT1
#> ENSG00000130876.10    -13.59828 0.8473029    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE
#> ENSG00000134438.9     -16.08255 0.8914740    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE
#> ENSG00000269696.1     -13.12002 0.7509754    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE
#> ENSG00000230393.1     -13.97772 0.5654292    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE
#> ENSG00000266744.1     -17.96174 0.4733715    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE
#> ENSG00000202281.1     -18.42068 0.1666667    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE

注：1. 核糖体蛋白质基因通常表现出高平均表达和低细胞类型特异性得分(局限在右下角)。
sc.stat 显示每个基因的归一化平均表达(x 轴)和最大特异性(y 轴)的对数，以及每个基因是否属于一个潜在的异常值类别。Other _ Rb 是由核糖体假基因组成的一组精选基因。

bk.stat <- plot.bulk.outlier(
  bulk.input=bk.dat,#make sure the colnames are gene symbol or ENSMEBL ID 
  sc.input=sc.dat, #make sure the colnames are gene symbol or ENSMEBL ID 
  cell.type.labels=cell.type.labels,
  species="hs", #currently only human(hs) and mouse(mm) annotations are supported
  return.raw=TRUE
  #pdf.prefix="gbm.bk.stat" specify pdf.prefix if need to output to pdf
)
head(bk.stat)
#>                    exp.mean.log  max.spec other_Rb  chrM  chrX  chrY    Rb   Mrp   act    hb MALAT1
#> ENSG00000000003.13    -9.263877 0.4825322    FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  FALSE
#> ENSG00000000005.5    -13.336689 0.9736083    FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  FALSE
#> ENSG00000000419.11   -10.455993 0.2275788    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE
#> ENSG00000000457.12   -11.298589 0.2169448    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE
#> ENSG00000000460.15   -11.648474 0.2920256    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE
#> ENSG00000000938.11   -11.152720 0.7347018    FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE

图6. 查看Bulk RNAseq的离群基因

查看完之后，将从选定的群体中移除基因。注意，当参考基因和混合基因的性别不同时，建议排除 chrX 和 chrY 基因。也删除低转录基因，因为测量这些基因的转录倾向于噪声。去除这些基因也可以加快计算速度。

sc.dat.filtered <- cleanup.genes (input=sc.dat,
                                   input.type="count.matrix",
                                   species="hs", 
                                   gene.group=c( "Rb","Mrp","other_Rb","chrM","MALAT1","chrX","chrY") ,
                                   exp.cells=5)
#> EMSEMBLE IDs detected.
#> number of genes filtered in each category: 
#>       Rb      Mrp other_Rb     chrM   MALAT1     chrX     chrY 
#>       89       78     1011       37        1     2464      594 
#> A total of  4214  genes from Rb Mrp other_Rb chrM MALAT1 chrX chrY  have been excluded 
#> A total of  24343  gene expressed in fewer than  5  cells have been excluded 
dim(sc.dat.filtered)
#> [1] 23793 31737

检查不同类型基因的基因表达的一致性。由于bulk RNA-seq 和scRNAseq data 通常由不同的实验方案收集，它们可能对不同类型的基因具有不同的敏感性。

plot.bulk.vs.sc (sc.input = sc.dat.filtered,
                 bulk.input = bk.dat
                 #pdf.prefix="gbm.bk.vs.sc" specify pdf.prefix if need to output to pdf
)
#> EMSEMBLE IDs detected.

图7. log2 表达bulk 和scRNA相关性

根据上图，可以发现：蛋白质编码基因是两个测定之间最一致的组。为了减少批量效应和加快计算速度，后续对蛋白质编码基因进行了反卷积处理。后续代码在所有基因上运行贝叶斯棱镜。二者结果是相似的。

根据 “蛋白编码基因”确定子集。

sc.dat.filtered.pc <-  select.gene.type (sc.dat.filtered,
                                          gene.type = "protein_coding")
#> EMSEMBLE IDs detected.
#> number of genes retained in each category: 
#> 
#> protein_coding 
#>          16148

根据‘signature gene’确定子集

或者，在细胞类型被定义为其中一些显示非常相似的转录或存在严重的批次效应的情况下，例如，reference 和mixture 来自不同的测定手段，选择signature gene是可行的。这是因为特征基因的选择可以丰富用于反卷积的基因信息，同时减少技术批量效应造成的噪声影响。

# Select marker genes (Optional)
# performing pair-wise t test for cell states from different cell types

diff.exp.stat <- get.exp.stat(sc.dat=sc.dat[,colSums(sc.dat>0)>3],# filter genes to reduce memory use
                              cell.type.labels=cell.type.labels,
                              cell.state.labels=cell.state.labels,
                              psuedo.count=0.1, #a numeric value used for log2 transformation. =0.1 for 10x data, =10 for smart-seq. Default=0.1.
                              cell.count.cutoff=50, # a numeric value to exclude cell state with number of cells fewer than this value for t test. Default=50.
                              n.cores=1 #number of threads
)

理想情况下，每种细胞类型最好选择足够数量的基因(> 50)。否则，可能会降低截止值。根据signature gene提取Count matrix subset ，可以运行以下代码

sc.dat.filtered.pc.sig <- select.marker (sc.dat=sc.dat.filtered.pc,
                                                  stat=diff.exp.stat,
                                                  pval.max=0.01,
                                                  lfc.min=0.1)
#> number of markers selected for each cell type: 
#> tumor :  8686 
#> myeloid :  575 
#> pericyte :  114 
#> endothelial :  244 
#> tcell :  123 
#> oligo :  86
dim(sc.dat.filtered.pc.sig)
#> [1] 23793  7874

注：如果使用signature gene 运行BayesPrism ，则在调用 new.prism 时使用 reference = sc.dat.filtered.pc.sig (见下文)。

构建Prism 对象

Prism对象包含运行 BayesPrism 所需的所有数据，即 scRNA-seq reference 矩阵、每行参考的 cell type and cell state labels 以及mixture（Bulk RNA-seq）

图8. Prism 对象示意图

当使用 scRNA-seq Count 矩阵作为input (推荐)时，用户需要指定 input.type = “ count. Matrix”。Type 的另一个选项是“ GEP”(gene expression profile) ，它是基因矩阵的细胞状态。当使用Bulk RNAseq的数据作为reference 时，应设置input.type ='GEP'。参数key是 cell.type.label 中对应于恶性细胞类型的字符向量。如无恶性细胞，即设置为 NULL，此时，所有类型的细胞将被 treated equally。

myPrism <- new.prism(
   reference=sc.dat.filtered.pc, 
   mixture=bk.dat,
   input.type="count.matrix", 
   cell.type.labels = cell.type.labels, 
   cell.state.labels = cell.state.labels,
   key="tumor",# 
   outlier.cut=0.01,
   outlier.fraction=0.1,
 )
#> number of cells in each cell state 
#> cell.state.labels
#> PJ017-tumor-6 PJ032-tumor-5     myeloid_8 PJ032-tumor-4 PJ032-tumor-3 PJ017-tumor-5         tcell PJ032-tumor-2 PJ030-tumor-5     myeloid_7 
#>            22            41            49            57            62            64            67            72            73            75 
#> PJ035-tumor-8 PJ017-tumor-4 PJ017-tumor-3 PJ017-tumor-2 PJ017-tumor-0 PJ017-tumor-1 PJ018-tumor-5     myeloid_6     myeloid_5 PJ035-tumor-7 
#>            81            83            89           101           107           107           113           130           141           150 
#>         oligo PJ048-tumor-8 PJ032-tumor-1 PJ032-tumor-0 PJ048-tumor-7 PJ025-tumor-9 PJ035-tumor-6 PJ048-tumor-6 PJ016-tumor-6 PJ018-tumor-4 
#>           160           169           171           195           228           236           241           244           261           262 
#>     myeloid_4 PJ048-tumor-5 PJ048-tumor-4 PJ016-tumor-5 PJ025-tumor-8 PJ048-tumor-3 PJ035-tumor-5 PJ030-tumor-4 PJ018-tumor-3 PJ016-tumor-4 
#>           266           277           303           308           319           333           334           348           361           375 
#> PJ025-tumor-7     myeloid_3 PJ035-tumor-4     myeloid_2 PJ025-tumor-6 PJ018-tumor-2 PJ030-tumor-3 PJ016-tumor-3 PJ025-tumor-5 PJ030-tumor-2 
#>           381           382           385           386           397           403           419           420           421           425 
#> PJ018-tumor-1 PJ035-tumor-3 PJ048-tumor-2 PJ018-tumor-0 PJ048-tumor-1 PJ035-tumor-2 PJ035-tumor-1 PJ016-tumor-2 PJ030-tumor-1      pericyte 
#>           429           435           437           444           463           471           474           481           482           489 
#>   endothelial PJ035-tumor-0 PJ025-tumor-4     myeloid_1 PJ048-tumor-0     myeloid_0 PJ030-tumor-0 PJ025-tumor-3 PJ016-tumor-1 PJ016-tumor-0 
#>           492           512           523           526           545           550           563           601           619           621 
#> PJ025-tumor-2 PJ025-tumor-1 PJ025-tumor-0 
#>           630           941           971 
#> Number of outlier genes filtered from mixture = 6 
#> Aligning reference and mixture... 
#> Nornalizing reference...

注：

Note that outlier.cut and outlier.fraction=0.1 filter genes in X whose expression fraction is greater than outlier.cut (Default=0.01) in more than outlier.fraction (Default=0.1) of bulk data.
Typically for dataset with reasonable quality control, very few genes will be filtered.
Removal of outlier genes will ensure that the inference will not be dominated by outliers, which sometimes may be resulted from poor QC in mapping.

运行 BayesPrism

图9. BayesPrism 流程示意

后续即运行BayesPrism控制 Gibbs 采样和优化的参数可以使用 Gibbs.control 和 opt.control 指定。?run.prism获取details。这里建议使用默认参数。（代码就一句）

bp.res <- run.prism(prism = myPrism, n.cores=50)
#> Run Gibbs sampling... 
#> Current time:  2023-01-08 22:46:55 
#> Estimated time to complete:  38mins 
#> Estimated finishing time:  2023-01-08 23:24:44 
#> Start run... 
#> R Version:  R version 4.2.0 (2022-04-22) 

#> snowfall 1.84-6.2 initialized (using snow 0.4-4): parallel execution on 50 CPUs.


#> Stopping cluster

#> Update the reference matrix ... 
#> snowfall 1.84-6.2 initialized (using snow 0.4-4): parallel execution on 50 CPUs.


#> Stopping cluster

#> Run Gibbs sampling using updated reference ... 
#> Current time:  2023-01-08 23:08:48 
#> Estimated time to complete:  17mins 
#> Estimated finishing time:  2023-01-08 23:24:50 
#> Start run... 
#> snowfall 1.84-6.2 initialized (using snow 0.4-4): parallel execution on 50 CPUs.


#> Stopping cluster

bp.res即是BayesPrism的结果。后续探讨bp.res的结果及可视化的问题

bp.res
#> Input prism info: 
#> Cell states in each cell type: 
#> $tumor
#>  [1] "PJ016-tumor-0" "PJ016-tumor-3" "PJ016-tumor-2" "PJ016-tumor-6" "PJ016-tumor-4" "PJ016-tumor-1" "PJ016-tumor-5" "PJ017-tumor-3"
#>  [9] "PJ017-tumor-2" "PJ017-tumor-1" "PJ017-tumor-0" "PJ017-tumor-4" "PJ017-tumor-5" "PJ017-tumor-6" "PJ018-tumor-4" "PJ018-tumor-2"
#> [17] "PJ018-tumor-3" "PJ018-tumor-1" "PJ018-tumor-0" "PJ018-tumor-5" "PJ025-tumor-9" "PJ025-tumor-2" "PJ025-tumor-4" "PJ025-tumor-3"
#> [25] "PJ025-tumor-0" "PJ025-tumor-1" "PJ025-tumor-8" "PJ025-tumor-7" "PJ025-tumor-5" "PJ025-tumor-6" "PJ030-tumor-4" "PJ030-tumor-0"
#> [33] "PJ030-tumor-1" "PJ030-tumor-5" "PJ030-tumor-3" "PJ030-tumor-2" "PJ032-tumor-0" "PJ032-tumor-1" "PJ032-tumor-5" "PJ032-tumor-2"
#> [41] "PJ032-tumor-4" "PJ032-tumor-3" "PJ035-tumor-2" "PJ035-tumor-3" "PJ035-tumor-6" "PJ035-tumor-1" "PJ035-tumor-0" "PJ035-tumor-4"
#> [49] "PJ035-tumor-5" "PJ035-tumor-8" "PJ035-tumor-7" "PJ048-tumor-0" "PJ048-tumor-3" "PJ048-tumor-4" "PJ048-tumor-2" "PJ048-tumor-6"
#> [57] "PJ048-tumor-1" "PJ048-tumor-8" "PJ048-tumor-7" "PJ048-tumor-5"

#> $myeloid
#> [1] "myeloid_2" "myeloid_5" "myeloid_3" "myeloid_7" "myeloid_0" "myeloid_6" "myeloid_4" "myeloid_1" "myeloid_8"
 
#> $pericyte
#> [1] "pericyte"

#> $endothelial
#> [1] "endothelial"

#> $tcell
#> [1] "tcell"

#> $oligo
#> [1] "oligo"


#> Identifier of the malignant cell type:  tumor 
#> Number of cell states:  73 
#> Number of cell types:  6 
#> Number of mixtures:  169 
#> Number of genes:  16145 

#> Initial cell type fractions: 
#>         tumor myeloid pericyte endothelial tcell oligo
#> Min.    0.199   0.004    0.000       0.015  0.00 0.000
#> 1st Qu. 0.702   0.051    0.014       0.036  0.00 0.010
#> Median  0.775   0.102    0.025       0.046  0.00 0.025
#> Mean    0.758   0.109    0.043       0.048  0.00 0.041
#> 3rd Qu. 0.848   0.145    0.048       0.060  0.00 0.050
#> Max.    0.952   0.578    0.503       0.133  0.01 0.329
#> Updated cell type fractions: 
#>         tumor myeloid pericyte endothelial tcell oligo
#> Min.    0.272   0.000    0.000       0.008 0.000 0.000
#> 1st Qu. 0.751   0.046    0.006       0.027 0.000 0.002
#> Median  0.815   0.090    0.014       0.039 0.000 0.011
#> Mean    0.801   0.098    0.030       0.041 0.001 0.029
#> 3rd Qu. 0.878   0.130    0.033       0.053 0.001 0.036
#> Max.    0.968   0.565    0.385       0.134 0.010 0.307

或者，如果想从1-3单独运行步骤4，可以执行以下操作:

# not run
bp.res.initial <- run.prism(prism = myPrism, n.cores=50, update.gibbs=FALSE)
    
bp.res.update <- update.theta (bp = bp.res.initial)

结语

以上是BayesPrism的全部实操内容，bp.res即是BayesPrism 的运算结果。至于如何理解BayesPrism的仔细原理，我非数学及计算机专业的就不强行解释了，感兴趣的可以深入系统学习数学理论。总的来说，在我这里，我只需要理解它要求的输入参数是经过一定处理的单细胞矩阵+细胞表型+bulk RNAseq矩阵，在BayesPrism 运算后能得到Bulk RNA-seq反卷积计算的数据即可。在本篇中，仅讨论了BayesPrism的数据准备及运算过程，尚未提及bp.res的后续处理，留待下期探讨。

- END -

ixxmu / mp_duty

这个bulk RNA-seq反卷积工具，你可能还不知道 #3113