自行构建SingleR数据库

自行构建SingleR数据库 by 生信漫漫学

写在开头

上篇推文给SingleR自动注释的内容开了个头，浅浅介绍了一下两个R包——SingleR及数据库资源包celldex简介

那既然是学习了使用singleR基于自建数据库来自动化注释单细胞转录组亚群，今天就来整理一下如何基于参考注释数据集，来自行构型SingleR注释数据库叭！

学习视频参考：人工注释与singleR注释差异(视频好像没办法插入链接，大家有需要的话，自己去技能树的视频号找一下直播回放)

参考数据库来源

基于需要分析的数据集的文献中——人海绵体单细胞转录组图谱，文章中第一层次降维聚类分群得到七个主要簇，包括ECs、FBS、周细胞(PC)、巨噬细胞(MACs)、SMCs、施万细胞(SWCs)和T细胞(Ts)

我们的参考数据库应该要包含这些细胞亚群，所以选择的文章是2020年的：《A Single-Cell Atlas of the Human Healthy Airways》，利用单细胞RNA图谱研究呼吸道细胞群分布和转录变化。

文章简介及数据情况

文章标题：《A Single-Cell Atlas of the Human Healthy Airways》

发表日期和杂志：2020年发表在American Journal of Respiratory and Critical Care Medicine上

在线阅读链接：https://doi.org/10.1164/rccm.201911-2199oc

单细胞实验设计：

采用单细胞RNA图谱技术对10例健康活体志愿者的呼吸道上皮细胞异质性进行了研究。总共在35个不同的位置收集了77,969个细胞，从鼻子到呼吸道树的第12部分。

主要结果：

得到的图谱由高比例的上皮细胞(89.1%)组成，但也包括免疫细胞(6.2%)和基质细胞(4.7%)，这些细胞在呼吸道的不同区域具有不同的细胞比例。

它揭示了来自鼻部(MUC4，PI3，Six3)和气管-支气管部(SCGB1A1，TFF3)的相同细胞类型(基底细胞，分泌细胞和多纤毛细胞)之间的差异基因表达。

亚群里面也包含了我们目标数据集中注释到的大部分亚群

数据链接：

https://www.genomique.eu/cellbrowser/HCA/

选择Data Download下载我们需要的数据即可

构型SingleR注释数据库

参考推文：生信技能树使用singleR基于自建数据库来自动化注释单细胞转录组亚群

读取数据降维聚类分群

因为文章提供了原始质控后的细胞计数矩阵表格，以及meta.tsv的注释信息，我们基于Raw_exprMatrix.tsv.gz信息可以创建seurat对象，然后质控和降维聚类分群流程。

rm(list=ls())
options(stringsAsFactors = F) 
source('scRNA_scripts/lib.R')
getwd()

###### step1: 导入数据 ######   
ct=fread( 'Raw_exprMatrix.tsv.gz',data.table = F)
ct[1:4,1:4]
rownames(ct)=ct[,1]
ct=ct[,-1] 

sce.all =CreateSeuratObject(counts =  ct , 
                            min.cells = 5,
                            min.features = 300 )

dim(ct)
dim(sce.all)

as.data.frame(sce.all@assays$RNA$counts[1:10, 1:2])
head(sce.all@meta.data, 10)
table(sce.all$orig.ident)
Idents(sce.all)


sp='human'

###### step2: QC质控 ######
dir.create("./1-QC")
setwd("./1-QC")
# 如果过滤的太狠，就需要去修改这个过滤代码
source('../scRNA_scripts/qc.R')
sce.all.filt = basic_qc(sce.all)
print(dim(sce.all))
print(dim(sce.all.filt))
setwd('../')
getwd()

###### step3: harmony整合多个单细胞样品 ######

if(T){
  dir.create("2-harmony")
  getwd()
  setwd("2-harmony")
  source('../scRNA_scripts/harmony.R')
  # 默认 ScaleData 没有添加"nCount_RNA", "nFeature_RNA"
  # 默认的
  sce.all.int = run_harmony(sce.all.filt)
  setwd('../')
  
}


###### step4:  看标记基因库 ######
# 原则上分辨率是需要自己肉眼判断，取决于个人经验
# 为了省力，我们直接看  0.1和0.8即可

table(Idents(sce.all.int))
table(sce.all.int$seurat_clusters)
table(sce.all.int$RNA_snn_res.0.1) 
table(sce.all.int$RNA_snn_res.0.8) 

getwd()
dir.create('check-by-0.1')
setwd('check-by-0.1')
sel.clust = "RNA_snn_res.0.1"
sce.all.int <- SetIdent(sce.all.int, value = sel.clust)
table(sce.all.int@active.ident) 

source('../scRNA_scripts/check-all-markers.R')
setwd('../') 
getwd()

dir.create('check-by-0.5')
setwd('check-by-0.5')
sel.clust = "RNA_snn_res.0.5"
sce.all.int <- SetIdent(sce.all.int, value = sel.clust)
table(sce.all.int@active.ident) 
source('../scRNA_scripts/check-all-markers.R')
setwd('../') 
getwd()

dir.create('check-by-0.8')
setwd('check-by-0.8')
sel.clust = "RNA_snn_res.0.8"
sce.all.int <- SetIdent(sce.all.int, value = sel.clust)
table(sce.all.int@active.ident) 
source('../scRNA_scripts/check-all-markers.R')
setwd('../') 
getwd()

获取需要的单细胞亚群

基于文章提供的meta.tsv可以获取细胞亚群的注释信息，具体可参考如何利用文献中的细胞注释信息

#读取文章提供的细胞亚群注视信息
  phe2=data.table::fread('supplements/meta.tsv',data.table = F)
  head(phe2)
  rownames(phe2)=phe2$Id
  table(phe2$Donor)
  head(phe2) 
  
  phe=sce.all.int@meta.data
  head(phe)
  ids=intersect(rownames(phe),rownames(phe2))
  g1=phe[ids,]$RNA_snn_res.0.8
  g2=phe2[ids,]$CellType
  table(g2)

  tmp =  table(g1 ,g2 );tmp
  gplots::balloonplot(
    table(g1 ,g2 )
  )
  
#结合目标文件选择需要的亚群
  chooseCT = c('Fibroblast','Smooth muscle','Pericyte','Endothelial',
               'Macrophage','Monocyte','Dendritic','LT/NK','B cells','Plasma cells')
  tmp  =  phe2[ids,]
  tmp  = tmp[tmp$CellType %in% chooseCT,] #选择我们需要的内容
  table(tmp$CellType)
  
#保存我们需要的内容
  sce.singleR=sce.all.int[,colnames(sce.all.int) %in% tmp$Id]
  
#给我们的变量命名并保存变量
  sce.singleR$paper =  tmp[match(colnames(sce.singleR),tmp$Id),'CellType']
  DimPlot(sce.singleR,group.by = 'paper',label = T,repel = T)
  save(sce.singleR,file = 'sce.singleR.Rdata')
  
as.data.frame(sort(table(sce.singleR$paper)))

文章全部亚群注释信息

结合目标文件选择后的亚群信息

最终保存下来的注释信息

和原推文使用singleR基于自建数据库来自动化注释单细胞转录组亚群，数量有出入是因为展示用的全部数据，没有进行抽样。

整理格式并创建适配singleR的数据库

将保存下来的注释信息整理成singleR需要的数据格式，SingleR包里面的SingleR函数需要的格式（SingleCellExperiment）

#转换数据格式，构建适配singleR算法的数据库文件
load(file = 'sce.singleR.Rdata')
sce = sce.singleR
colnames(sce@meta.data)
Idents(sce) = sce$paper
table(Idents(sce) )
##NOTE：以前是AverageExpression,它用于计算每个细胞在不同cluster的基因表达均值
av <- AggregateExpression(sce ,      # group.by = "celltype",
                         assays = "RNA") 
Ref = av[[1]]

head(Ref)
ref_sce = SingleCellExperiment::SingleCellExperiment(assays=list(counts=Ref))

library(scater)
ref_sce=scater::logNormCounts(ref_sce)

library(SingleCellExperiment)
logcounts(ref_sce)[1:4,1:4]
colData(ref_sce)$Type=colnames(Ref)
table(colnames(Ref))

ref_sce

#保存 ref_sce 变量
save(ref_sce,file = 'HCA_airway_ref_sce.Rdata')

得到ref_sce 变量，就可以使用SingleR包里面的SingleR函数，对我们的目标分析数据进行注释了。

小tips：在构建数据库文件时候使用AggregateExpression计算每个细胞在不同cluster的基因表达均值,曾老师让我对比看看不取均值的情况。

等我先摸索看看，大家也可以对比一下计算均值和不计算均值的区别，那今天就先浅浅的整理到这里叭！

ixxmu / mp_duty