CellChat细胞通讯分析（一）

CellChat细胞通讯分析（一） by 生信菜鸟团

曾老师在【细胞通讯分析的背景知识】一文中介绍到：不同细胞类型和组织中的细胞之间相互作用(CCI)，主要是通过各种分子包括离子、代谢物、整联蛋白、受体、连接蛋白、结构蛋白、配体和细胞外基质的分泌蛋白来实现的。一些分子如激素、生长因子、趋化因子、细胞因子和神经递质等配体介导细胞之间的信息交流(CCC)。人们可以从基因表达中推断出CCI和CCC，尤其是在单细胞转录组测序之后，以RNA为代表的细胞信息之间的相互作用更能够对其进行解析。

细胞之间的相互通讯极为复杂，一般由激素、生长因子、趋化因子、细胞因子和神经递质等配体介导细胞之间的信息交流，主要分为自分泌，旁分泌，近分泌和内分泌等。当前基于单细胞转录组测序，进行细胞通讯分析越来越常见。细胞通讯的方法主要有CellphoneDB（基于Python），Cellchat（基于R），iTALK（基于R），NicheNet等方法。生信技能树，生信菜鸟团和单细胞天地写了大量关于细胞通讯的帖子，例如：

我最近因为课题需要，仔细阅读了CellChat的官方文档。虽然CellChat的文档今年刚更新，但是我在实战中还是遇到了很多Bug......另外，在深入理解这个包的函数和分析思维逻辑之后，感触很多，由此萌生了写此系列的想法：

一方面帮助大家debug，少走一些弯路；
另一方面，分享我对CellChat的一些（狭隘的）理解，以及使用过程发现的小技巧。

在生物学知识以外，希望能给需要的人一些启发。

一. CellChat介绍

CellChat R包于2021年1月发表在Nature communication上，论文题为《Inference and analysis of cell-cell communication using CellChat》，DOI: 10.1038/s41467-021-21246-9. 当前引用量400+。虽然引用量低于隔壁的CellphoneDB，但CellChat基于R语言环境，和Seurat衔接比较好，同时其具备丰富的可视化函数（颜值还不低），所以当前的热度也很高。

实际上，无论是CellphoneDB还是CellChat，其基于单细胞转录组数据推断细胞间通讯，均是利用对受配体的先验知识，即基于当前已知的/已证明的互作的受配体对，对细胞与细胞之间的通讯情况进行建模，来预测细胞A和细胞B的受配体通讯情况，从而构建细胞间通讯网络。在此基础上CellChat为进一步的数据探索、分析和可视化提供了大量的函数。

值得注意的是，CellChat采用了一种自上而下的方法，即从宏观角度为始，然后在信号机制上细化它的更详细的细节（微观解读），以识别不同层次的信号变化，包括细胞-细胞通讯的一般原则和功能失调的细胞群/信号通路/配体受体。简言之，CellChat回答三个角度的问题：

首先，哪些细胞亚群和哪些细胞亚群互作（包括细胞A与自己本身，即为自分泌），互作的数量和强度;
这里的互作数量实际就是细胞A和细胞B存在的配受体对，总共存在几对；其强度可以理解为对应配受体对通讯概率的累和。
其次，细胞亚群A和细胞亚群B存在哪些通讯相关的通路，而本质上，配受体对应的基因构成了通讯通路；
因此，第三个问题就是回答亚群A和细胞亚群B存在哪些互作的配受体，其通讯概率多大。

在运行CellChat的基本流程之后得到了细胞间通讯网络，从这里开始，围绕这三个问题，作者给出了一系列的代码和可视化函数。带着这种自上而下，从细胞，到通路，再到基因（受配体对）的思维，我们能够更好的理解作者的分析思路和代码思维。首先第一步，安装CellChat R包：

CellChat 的github：https://github.com/sqjin/CellChat
数据库官网：http://www.cellchat.org/
R包CellChat的安装：

remotes::install_github("sqjin/CellChat")

二. 利用CellChat进行细胞间通讯的推断与分析

Part I：CellChat对象的数据输入、处理和初始化

CellChat需要输入两部分数据：

一是单细胞的基因表达数据（例如用Seurat包的normalizeData后的data数据，表达矩阵储存在seurat.data@assays$RNA@data）；
二是单细胞数据的meta.data，其中包含细胞类型的注释信息。

下面加载Seurat自带的PMBC单细胞数据作为演示：

library(CellChat)
library(patchwork)
library(Seurat)
library(SeuratData)
# AvailableData()
# InstallData("pbmc3k")
options(stringsAsFactors = FALSE)
data("pbmc3k")
pbmc3k

Step1. 构建cellchat对象

#pbmc3k里的seurat_annotations有一些NA注释，过滤掉
data.input = pbmc3k@assays$RNA@data
meta.data =  pbmc3k@meta.data
meta.data = meta.data[!is.na(meta.data$seurat_annotations),]
data.input = data.input[,row.names(meta.data)]

#设置因子水平
meta.data$seurat_annotations = factor(meta.data$seurat_annotations,
                                      levels = c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", 
                                                 "FCGR3A+ Mono", "NK", "DC", "Platelet"))

### 1.2 Create a CellChat object
cellchat <- createCellChat(object = data.input, 
                           meta = meta.data, 
                           group.by = "seurat_annotations")

和Seurat一样，如有需要的话，可在CellChat对象的meta插槽中添加表型信息，以及更改当前默认的idents：

### 1.3 可在cellchat对象的meta插槽中添加表型信息
# 添加meta.data信息
cellchat <- addMeta(cellchat, meta = meta.data)

# 设置默认的labels
levels(cellchat@idents) # show factor levels of the cell labels
cellchat <- setIdent(cellchat, ident.use = "new.labels") 
groupSize <- as.numeric(table(cellchat@idents)) # number of cells in each cell group

Step2. 加载CellChatDB数据库

CellChatDB数据库是一个人工管理的数据库，包含了人类和小鼠的文献支持的配体-受体相互作用。

小鼠包含2021种已验证的分子相互作用，包括60%的分泌自分泌/旁分泌信号互作、21%的细胞外基质(ECM)受体互作和19%的细胞-细胞通讯互作。
人类包含1939个已验证的分子相互作用，包括61.8%的旁分泌/自分泌信号互作、21.7%的细胞外基质(ECM)受体互作和16.5%的细胞-细胞通讯互作。

当然用户也可以通过添加自己精心策划的配体-受体对来更新CellChatDB（一般用的比较少），可参考https://htmlpreview.github.io/?https://github.com/sqjin/CellChat/blob/master/tutorial/Update-CellChatDB.html。

### 1.4 加载CellChat受配体数据库
CellChatDB <- CellChatDB.human # use CellChatDB.mouse if running on mouse data
showDatabaseCategory(CellChatDB)

# Show the structure of the database
dplyr::glimpse(CellChatDB$interaction)

> dplyr::glimpse(CellChatDB$interaction)
Rows: 1,939
Columns: 11
$ interaction_name   <chr> "TGFB1_TGFBR1_TGFBR2", "TGFB2_TGFBR1_TGFBR2", "TGFB3_TGFBR1_TGFBR2", "TGFB1_ACVR1B_TGFBR2", "TGFB1~
$ pathway_name       <chr> "TGFb", "TGFb", "TGFb", "TGFb", "TGFb", "TGFb", "TGFb", "TGFb", "TGFb", "TGFb", "TGFb", "TGFb", "B~
$ ligand             <chr> "TGFB1", "TGFB2", "TGFB3", "TGFB1", "TGFB1", "TGFB2", "TGFB2", "TGFB3", "TGFB3", "TGFB1", "TGFB2",~
$ receptor           <chr> "TGFbR1_R2", "TGFbR1_R2", "TGFbR1_R2", "ACVR1B_TGFbR2", "ACVR1C_TGFbR2", "ACVR1B_TGFbR2", "ACVR1C_~
$ agonist            <chr> "TGFb agonist", "TGFb agonist", "TGFb agonist", "TGFb agonist", "TGFb agonist", "TGFb agonist", "T~
$ antagonist         <chr> "TGFb antagonist", "TGFb antagonist", "TGFb antagonist", "TGFb antagonist", "TGFb antagonist", "TG~
$ co_A_receptor      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""~
$ co_I_receptor      <chr> "TGFb inhibition receptor", "TGFb inhibition receptor", "TGFb inhibition receptor", "TGFb inhibiti~
$ evidence           <chr> "KEGG: hsa04350", "KEGG: hsa04350", "KEGG: hsa04350", "PMID: 27449815", "PMID: 27449815", "PMID: 2~
$ annotation         <chr> "Secreted Signaling", "Secreted Signaling", "Secreted Signaling", "Secreted Signaling", "Secreted ~
$ interaction_name_2 <chr> "TGFB1 - (TGFBR1+TGFBR2)", "TGFB2 - (TGFBR1+TGFBR2)", "TGFB3 - (TGFBR1+TGFBR2)", "TGFB1 - (ACVR1B+~

# use a subset of CellChatDB for cell-cell communication analysis
#CellChatDB.use <- subsetDB(CellChatDB, search = "Secreted Signaling") # use Secreted Signaling
CellChatDB.use <- CellChatDB # simply use the default CellChatDB
cellchat@DB <- CellChatDB.use

Step3. 对表达数据进行预处理

为了推断细胞状态特异性通讯，我们在一个细胞群中识别过表达的配体或受体，然后在配体或受体过度表达时识别过表达的配体-受体相互作用。我们还提供了一种函数，将基因表达数据投影到蛋白质-蛋白质相互作用(PPI)网络。具体地说，一个扩散过程是用来平滑基因表达值基于他们的邻居定义在一个高置信度实验验证的蛋白质-蛋白质网络。当分析浅层测序深度的单细胞数据时，该函数是有用的，因为投影减少了信号基因的dropout效应，特别是配体/受体亚基可能的零表达。有人可能会担心这种扩散过程可能带来的人工制品，然而，它只会带来非常弱的通讯。用户也可以跳过此步骤并设置raw。在computeCommunProb()函数中使用= TRUE。

### 1.5 对表达数据进行预处理，用于细胞间通讯分析
# subset the expression data of signaling genes for saving computation cost
cellchat <- subsetData(cellchat) # This step is necessary even if using the whole database
future::plan("multiprocess", workers = 1) # do parallel

cellchat <- identifyOverExpressedGenes(cellchat)
cellchat <- identifyOverExpressedInteractions(cellchat)

# project gene expression data onto PPI (Optional: when running it, USER should set `raw.use = FALSE` in the function `computeCommunProb()` in order to use the projected data)
# cellchat <- projectData(cellchat, PPI.human)

Part II：细胞-细胞通讯网络的推断

CellChat通过给每个相互作用赋一个概率值并进行排列测试来推断生物意义的细胞-细胞通讯。CellChat通过整合基因表达和信号配体、受体及其辅助因子之间相互作用的先验知识，利用质量作用定律来建模细胞-细胞间通讯的概率。推断出的配体-受体对的数目显然取决于计算每个细胞组平均基因表达的方法。默认情况下，CellChat使用一种统计上稳健的平均值方法，叫做trimean，它产生的交互比其他方法少。然而，我们发现，CellChat在预测更强的交互方面表现得很好，这对缩小交互范围进行进一步的实验验证非常有帮助。在computeCommunProb中，我们提供了使用其他方法(如5%和10%截断平均值)来计算平均基因表达的选项。值得注意的是，trimean近似于25%的截断平均值，这意味着如果一组中表达细胞的百分比小于25%，那么平均基因表达为零。要使用10%的截断平均值，用户可以设置type = "truncatedMean"和trim = 0.1。computeAveExpr函数可以帮助检查信号基因的平均表达，如computeAveExpr(cellchat, features = c("CXCL12"，"CXCR4")， type = "truncatedMean"， trim = 0.1)。在分析未排序的单细胞转录组时，在细胞数量丰富的群体比细胞数量稀少的群体更倾向于发出集体更强的信号的假设下，CellChat在概率计算中也可以考虑各细胞群体中细胞比例的影响。用户可以设置 population. size = TRUE.

Step4. 计算通讯概率，推断细胞通讯网络

如果所研究的生物过程中已知的信号通路无法预测，用户可以尝试truncatedMean来改变计算每个细胞组平均基因表达的方法。

对于population. size参数的解读：是否考虑所有测序细胞中每组细胞的比例，

population. size = FALSE：如果分析分选富集的单细胞，以消除潜在的人为因素细胞群的大小。
population. size = TRUE，如果分析非分选富集的单细胞转录组，原因是丰富的细胞群往往比稀少的细胞群发送集体更强的信号。

cellchat <- computeCommunProb(cellchat,population.size = F)
# Filter out the cell-cell communication if there are only few number of cells in certain cell groups
cellchat <- filterCommunication(cellchat, min.cells = 10)

Step5. 提取预测的细胞通讯网络为data frame

小技巧：如何取cellchat的子集？

### cellchat取子集
barcode.use = sample(row.names(cellchat@meta),100)
cellchat.subset = subsetCellChat(cellchat,cells.use = barcode.use)

作者提供了函数subsetCommunication去提取用户感兴趣的配受体对强度为data.frame：

注意：这里提取受配体信息为data.frame的tips是很有用的，例如借助这个data.frame数据，我们可以个性化展示/可视化某些特定的受配体/通路，具体用法见下期推文。

df.net <- subsetCommunication(cellchat) returns a data frame consisting of all the inferred cell-cell communications at the level of ligands/receptors. Set slot.name = "netP" to access the the inferred communications at the level of signaling pathways
```
#获取所有的配受体对以及其通讯概率
df.net <- subsetCommunication(cellchat)
head(df.net)
```
image-20220816174933476
```
#以通路为单位提取通讯信息
df.pathway = subsetCommunication(cellchat,slot.name = "netP")
```
image-20220816175233270

df.net <- subsetCommunication(cellchat, sources.use = c(1,2), targets.use = c(4,5)) gives the inferred cell-cell communications sending from cell groups 1 and 2 to cell groups 4 and 5.

levels(cellchat@idents)
#[1] "Naive CD4 T"  "Memory CD4 T" "CD14+ Mono"   "B"            "CD8 T"        "FCGR3A+ Mono" "NK"           "DC"          
#[9] "Platelet"

这里的**source.use = c(1)**指的是Naive CD4 T，2和3分别对应Memory CD4 T和CD14+ Mono：

df.net <- subsetCommunication(cellchat, sources.use = c(1), targets.use = c(2,3))
head(df.net)

df.net <- subsetCommunication(cellchat, signaling = c("MIF", "MHC-I")) gives the inferred cell-cell communications mediated by signaling MIF and MHC-I.
```
df.net <- subsetCommunication(cellchat, signaling = c("MIF", "MHC-I"))
head(df.net)
```
image-20220816175748824

Step6. 在信号通路水平推断细胞通讯

CellChat通过汇总与每个信号通路相关的所有配体-受体相互作用的通讯概率，计算信号通路层面的通讯概率。

注意：推断出的每个配体-受体对的细胞间通讯网络和每个信号通路分别存储在' net '和' netP '插槽中。

cellchat <- computeCommunProbPathway(cellchat)
head(cellchat@net)
head(cellchat@netP)

Step7. 计算加和的cell-cell通讯网络

我们可以通过计算links数或总结细胞通讯概率来计算加和的cell-cell通讯网络。用户还可以通过设置sources.use 和targets.use来计算细胞亚群之间的加和网络。

cellchat <- aggregateNet(cellchat)

然后，我们还可以可视化加和的细胞间通讯网络。例如，使用circle plot显示任意两个细胞亚群之间的通讯次数或总通讯强度(权重)：

groupSize <- as.numeric(table(cellchat@idents))
par(mfrow = c(1,2), xpd=TRUE)
netVisual_circle(cellchat@net$count, vertex.weight = groupSize,
                 weight.scale = T, label.edge= F,
                 title.name = "Number of interactions")
netVisual_circle(cellchat@net$weight, vertex.weight = groupSize,
                 weight.scale = T, label.edge= F,
                 title.name = "Interaction weights/strength")

由于细胞间通讯网络的复杂性，我们可以对每个细胞亚群发出的信号进行检测。这里我们还控制参数edge.weight.max，以便我们可以比较不同网络之间的边权值：

mat <- cellchat@net$weight
par(mfrow = c(3,3), xpd=TRUE)
for (i in 1:nrow(mat)) {
  mat2 <- matrix(0, nrow = nrow(mat), ncol = ncol(mat), dimnames = dimnames(mat))
  mat2[i, ] <- mat[i, ]
  netVisual_circle(mat2, vertex.weight = groupSize,
                   weight.scale = T, edge.weight.max = max(mat),
                   title.name = rownames(mat)[i])
}

从这里开始，以上代码和函数均为得到细胞通讯网络。接下来，作者围绕细胞层面，通路层面和基因层面等，开发了一系列的可视化函数供用户进行数据探索。有一些图表和代码上的理解，以及存在一些Bug，我们下期再聊。

- END -

ixxmu / mp_duty