拟南芥根系单细胞亚群类型鉴定

拟南芥根系单细胞亚群类型鉴定 by 生信技能树

考虑到咱们生信技能树粉丝对单细胞数据挖掘的需求，我开通了一个专栏《100个单细胞转录组数据降维聚类分群图表复现》，也亲自示范了几个，不过自己带娃，读博，时间精力有限，所以把剩余的90多个任务安排了学徒，实习生，学员。真的是太棒了，群策群力！

下面新晋小编“十年”的投稿

今天要介绍的文章是：《High-throughput single-cell transcriptome profiling of plant cell types》. Cell Report 2019 ，它的数据链接是：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122687，值得一提的拟南芥根部单细胞转录组数据文章已经不少了，这个数据集也并不是质量最高的，选择它来讲解，仅仅是一个随机事件。

可以看到这个GSE122687数据集里面是10个样品，如下所示：

GSM3478124 Arabidopsis_root_Drop-seq_replicate_A
GSM3478125 Arabidopsis_root_Drop-seq_replicate_B
GSM3478126 Arabidopsis_root_Drop-seq_replicate_C
GSM3478127 Arabidopsis_root_Drop-seq_replicate_D
GSM3478128 Arabidopsis_root_Drop-seq_replicate_E
GSM3478129 Arabidopsis_root_Drop-seq_replicate_F
GSM3478130 Arabidopsis_root_Drop-seq_replicate_G
GSM3478131 Arabidopsis_root_Drop-seq_replicate_H
GSM3478132 Arabidopsis_root_Drop-seq_replicate_I
GSM3478133 Arabidopsis_root_Drop-seq_replicate_J

然后作者提供了这10个样品的表达矩阵文件，初步看起来并不是10x商业仪器出来数据：

 [1] "GSM3478124_replicate_A_error_detected_200genes.dge.txt.gz"
 [2] "GSM3478125_replicate_B_error_detected_200genes.dge.txt.gz"
 [3] "GSM3478126_replicate_C_error_detected_200genes.dge.txt.gz"
 [4] "GSM3478127_replicate_D_error_detected_200genes.dge.txt.gz"
 [5] "GSM3478128_replicate_E_error_detected_200genes.dge.txt.gz"
 [6] "GSM3478129_replicate_F_error_detected_200genes.dge.txt.gz"
 [7] "GSM3478130_replicate_G_error_detected_200genes.dge.txt.gz"
 [8] "GSM3478131_replicate_H_error_detected_200genes.dge.txt.gz"
 [9] "GSM3478132_replicate_I_error_detected_200genes.dge.txt.gz"
[10] "GSM3478133_replicate_J_error_detected_200genes.dge.txt.gz"

这个数据的研究目标是：单细胞RNA-seq可以用来描述植物的发育过程，并展示了它们应对蔗糖刺激的变化，并且文中描述了发生在内胚层发育过程中的转录组变化，并确定了近800个随着组织分化而动态表达的基因。

本次研究总共是12,198个单细胞，利用Drop-Seq对12,198个单个拟南芥根细胞进行t分布随机邻近嵌入(t-SNE)降维。使用Seurat将细胞聚类到17个生物学亚群。点表示单个单元格，并根据图例按指定的单元格类型和聚类进行着色。

作者列出来了详细的细胞亚群注释依据，可以看到每个亚群的特异性基因的高表达情况：

理解这些基因名字对应的拟南芥根部单细胞亚群，需要一定生物学背景知识哦！至少绝大部分做肿瘤研究的小伙伴肯定是没办法一眼就看出来这些基因的功能，通常呢，我们做肿瘤研究的单细胞数据，一般来说会选择初步很粗狂的定义大的细胞亚群，比如我常用的第一次分群是通用规则是：

immune (CD45+,PTPRC),
epithelial/cancer (EpCAM+,EPCAM),
stromal (CD10+,MME,fibo or CD31+,PECAM1,endo)

但是这个是植物学单细胞研究，所以我的经验基本上在这里完全是用不上的！不过我们前面讲解的单细胞转录组数据分析流程是没有问题的：可视化单细胞亚群的标记基因的5个方法，下面的5个基础函数相信大家都是已经烂熟于心了：

VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
FeaturePlot(pbmc, features = c("MS4A1", "CD79A"))
RidgePlot(pbmc, features = c("MS4A1", "CD79A"), ncol = 1)
DotPlot(pbmc, features = unique(features)) + RotatedAxis()
DoHeatmap(subset(pbmc, downsample = 100), features = features, size = 3)

开始复现

首先保证运行环境下面有一个 GSE122687文件夹，里面下载好了我们前面提到的10个文件，如下所示：

rm(list=ls())
options(stringsAsFactors = F) 
library(data.table)
getwd()
setwd("/Users/bio2/Desktop/GSE122687_At_root_10")
fs=list.files(path = 'GSE122687_RAW/',
              pattern = 'dge.txt.gz')
fs

数据链接是：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122687

step1构建对象

sceList <-  lapply(fs, function(x){
  a=fread(file.path( 'GSE122687_RAW/',x),data.table = F )
  a[1:4,1:4]
  raw.data=a[,-1]
  rownames(raw.data)=a[,1]
  library(stringr)
  p=str_split(x,'_',simplify = T)[,3]
  print(p)
  sce <- CreateSeuratObject(counts = raw.data,project = p )
})

sceList
length(sceList)
sce.all <- merge(sceList[[1]], 
                 y= sceList[-1]
                 ) 
saveRDS(sce.all,file="sce.all_raw.rds")

step2进行多样品整合

因为是多个样品：

GSM3478124 Arabidopsis_root_Drop-seq_replicate_A
GSM3478125 Arabidopsis_root_Drop-seq_replicate_B
GSM3478126 Arabidopsis_root_Drop-seq_replicate_C
GSM3478127 Arabidopsis_root_Drop-seq_replicate_D
GSM3478128 Arabidopsis_root_Drop-seq_replicate_E
GSM3478129 Arabidopsis_root_Drop-seq_replicate_F
GSM3478130 Arabidopsis_root_Drop-seq_replicate_G
GSM3478131 Arabidopsis_root_Drop-seq_replicate_H
GSM3478132 Arabidopsis_root_Drop-seq_replicate_I
GSM3478133 Arabidopsis_root_Drop-seq_replicate_J

这里采用 SCT 方法：


# SCT法整合数据

# 1. SCTransform
for(i in 1:length(seob_list)){
  seob_list[[i]] <- SCTransform(
    seob_list[[i]], 
    variable.features.n = 3000,
    # vars.to.regress = c("percent.mt", "S.Score", "G2M.Score"),
    verbose = FALSE)
}

# 2. 数据整合（Integration）
## 选择要用于整合的基因
features <- SelectIntegrationFeatures(object.list = seob_list,
                                      nfeatures = 3000)
## 准备整合
seob_list <- PrepSCTIntegration(object.list = seob_list, 
                                anchor.features = features)
## 找 anchors，15到30分钟
anchors <- FindIntegrationAnchors(object.list = seob_list, 
                                  # reference = 3 # 当有多个样本时，制定一个作为参考可加快速度
                                  normalization.method = "SCT", # 选择方法“SCT”
                                  anchor.features = features
)
## 整合数据
seob <- IntegrateData(anchorset = anchors, 
                      normalization.method = "SCT")
DefaultAssay(seob) <- "integrated"

step3 降维聚类分群

## 降维分析----
rm(anchors,seob_list,features,i)
seob <- RunPCA(seob)
ElbowPlot(seob, ndims = 50)

## t-SNE 不是基于表达矩阵做的，而是基于PCA结果做的
seob <- RunTSNE(seob, 
                # 使用多少个PC，如果使用的是 SCTransform 建议多一些
                dims = 1:30) 

## UMAP
seob <- RunUMAP(seob, dims = 1:30)

## 可视化
head(seob@meta.data)

p1 <- DimPlot(seob, 
              reduction = "pca", # pca, umap, tsne
              group.by = "orig.ident",
              label = F)

p2 <- DimPlot(seob, 
              reduction = "tsne", # pca, umap, tsne
              group.by = "orig.ident",
              label = F)

p3 <- DimPlot(seob, 
              reduction = "umap", # pca, umap, tsne
              group.by = "orig.ident",
              label = F)

p1 + p2 + p3  + plot_layout(guides = "collect") & theme(legend.position = "top")
## 检查批次效应
DimPlot(seob, 
        reduction = "umap", # pca, umap, tsne
        #group.by = "sample",
        label = F)

# 检查连续型协变量：线粒体比例
# FeaturePlot(seob, 
#             reduction = "umap", # pca, umap, tsne
#             features = "percent.mt")

## 聚类分析 -----
seob <- FindNeighbors(seob,
                      k.param = 5,# 最小值为5
                      dims = 1:30)
seob <- FindClusters(seob,
                     resolution = 0.3, # 分辨率值越大，cluster 越多0~1之间
                     random.seed = 1) #设置随机种子

# 可视化
DefaultAssay(seob) <- "RNA"
p1 <- DimPlot(seob, 
              reduction = "pca", # pca, umap, tsne
              group.by  = "seurat_clusters",
              label = T)

p2 <- DimPlot(seob, 
              reduction = "tsne", # pca, umap, tsne
              group.by  = "seurat_clusters",
              label = T)

p3 <- DimPlot(seob, 
              reduction = "umap", # pca, umap, tsne
              group.by  = "seurat_clusters",
              label = T)

p1 + p2 + p3

如下所示：

可以看到，初步分成了20个亚群，接下来需要对它们进行生物学命名！

step4 细胞亚群注释

首先看文章里面的标记基因在各个亚群的表达情况：

## 细胞注释----
## 加载R包
library(SummarizedExperiment)
## 构建CellMarker文件
library(readr)
cell_marker <- read.table(file = "CellMarker.csv", 
                          sep = "\t",
                          header = T) 
cell_marker
## marker 基因表达

## 可视化
library(tidyr)
# cell_marker <- separate_rows(cell_marker, Cell_Marker, sep = ",") %>% # 转为长数据
#   distinct() %>% # 去除重复
#   arrange(Cell_Type) # 排序

p1 <- DimPlot(seob, 
              reduction = "umap", # pca, umap, tsne
              group.by  = "seurat_clusters",
              label = T)

p2 <- DotPlot(seob, features = unique(cell_marker)) +
  theme(axis.text = element_text(size = 8,# 修改基因名的位置
                                 angle = 90,
                                 hjust = 1))
p1 + p2

如下所示：

上面的构建CellMarker文件，需要自己看文献，拿到基因名字，而且是拟南芥的ID哦，这个需要自己进行一些转换！

高表达不同基因（构建CellMarker文件）的亚群可以根据生物学背景进行命名：

## 将细胞类型信息加到meat.data表中
## 构建细胞类型向量
cluster2type <- c("0"="None",
                  "1"="QC",
                  "2"="vascular tissue",
                  "3"="None",
                  "4"="Hair cells",
                  "5"="plasmodesma",
                  "6"="Non-hair cells",
                  "7"="Non-hair cells",
                  "8"="QC",
                  "9"="Endodermis",
                  "10"="epidermal cells", 
                  "11"="Stele",
                  "12"="Stele",
                  "13"="None",
                  "14"="None",
                  "15"=" Root meristem",
                  "16"="None",
                  "17"="Stele",
                  "18"="None",
                  "19"="Stele")

## 可视化
## 二维图
FeaturePlot(seob, 
            reduction = "umap", 
            features = c("AT1G28290", "AT1G12080", "AT2G40205", "AT3G55230"), 
            #split.by = "sample",
            label = TRUE) 

## 小提琴图
VlnPlot(object = seob, 
        features = c("AT1G28290", "AT1G12080")) + plot_layout(guides = "collect") & theme(legend.position = "top")

## 添加细胞类型

seob[['cell_type']] = unname(cluster2type[seob@meta.data$seurat_clusters])

DimPlot(seob, 
        reduction = "tsne",label.size = 6, 
        group.by = "cell_type",
        label = TRUE, pt.size = 0.5) + NoLegend()
## 保存数据
save(seob, file = "data/GSE96583/206.rdata")

最后拿到了有生物学意义的亚群：

后续还有更多分析，但是都强烈依赖于拟南芥的背景知识哦！

写在文末

咱们现在这个专栏《100个单细胞转录组数据降维聚类分群图表复现》分享的代码是到此为止，但是一般来说单细胞文章数据分析还有很多进阶图表制作，比如inferCNV看肿瘤拷贝数变异，monocle看拟时序等等。如果你也需要，可以加入我们这个专栏《100个单细胞转录组数据降维聚类分群图表复现》创作团队，获取进阶指引哦！

文末友情推荐

与十万人一起学生信，你值得拥有下面的学习班：

ixxmu / mp_duty