构建seurat对象之初就应该是把基因名字转换好

ixxmu commented 1 year ago

https://mp.weixin.qq.com/s/LmDjcIscBgYb7izKYPkVOg

ixxmu commented 1 year ago

构建seurat对象之初就应该是把基因名字转换好 by 生信技能树

看到单细胞转录组测序数据的文献：《Single-cell sequencing links multiregional immune landscapes and tissue-resident T cells in ccRCC to tumor topology and therapy efficacy》，是提供了配套数据，所以，我下载了ccRCC_6pat_Seurat 文件，居然是26G，非常巨大！其降维聚类分群和生物学命名后，如下所示：

降维聚类分群和生物学命名

可以看到，主要是T淋巴细胞，和髓系淋巴细胞，少部分b淋巴细胞，部分内皮细胞和成纤维细胞，以及上皮细胞。尤其是T淋巴细胞的各个亚群，非常清晰，值得学习！

我仔细看了看那个26G的ccRCC_6pat_Seurat 文件，，发现是旧版本的seurat对象，所以使用了下面的代码进行转换：

rm(list=ls())
options(stringsAsFactors = F)
library(Seurat) 
sce=readRDS('ccRCC_6pat_Seurat')
sce
sce=UpdateSeuratObject(sce)
sce
DimPlot(sce)
save(sce,file = 'first_sce.Rdata')

关键就是 UpdateSeuratObject 函数啦！我看了看，存储后的文件，就1个G啦，非常棒！

看了看这个对象里面的元素应有尽有，可以直接绘图如下：

降维聚类分群都OK了，接下来就是根据背景知识对这些细胞亚群进行生物学命名啦！

但是使用基因名字去可视化各个亚群，全部报错，因为它的基因居然是ensemb 数据库的ID ，并不是我们常见的基因symbol ：

> head(rownames(sce))
[1] "ENSG00000237683" "ENSG00000228463" "ENSG00000225880" "ENSG00000230368" "ENSG00000187634"
[6] "ENSG00000188976"

需要做一个转换，下面演示一下这个步骤：

首先拿到基因转换列表：

rm(list=ls())
library(Seurat)
load(file = 'first_sce.Rdata')
sce #   16323 features 
library(org.Hs.eg.db)
head(rownames(sce))
ids=select(org.Hs.eg.db,keys = rownames(sce),
           columns = c('ENSEMBL','SYMBOL'),
           keytype = 'ENSEMBL')
head(ids)
dim(ids) # [1] 16428 
ids=na.omit(ids)
dim(ids) # [1] 15504 
length(unique(ids$SYMBOL)) # [1] 15494 
# 这里的关系超级乱，互相之间都不是一对一 
# 凡是混乱的ID一律删除即可
ids=ids[!duplicated(ids$SYMBOL),]
ids=ids[!duplicated(ids$ENSEMBL),]
pos=match(ids$ENSEMBL,rownames(sce) )
sce=sce[pos,]
sce # 15393 features
# rownames(sce) = ids$SYMBOL

本来是准备使用 rownames(sce) = ids$SYMBOL 搞定这个 seurat对象的基因名字转换，但是失败了。

略微搜索了一下，https://github.com/satijalab/seurat/issues/2617 提到了一个函数：

# https://github.com/satijalab/seurat/issues/2617
# RenameGenesSeurat  ------------------------------------------------------------------------------------
RenameGenesSeurat <- function(obj , 
                              newnames ) { 
  # Replace gene names in different slots of a Seurat object. Run this before integration. Run this before integration. 
  # It only changes obj@assays$RNA@counts, @data and @scale.data.
  print("Run this before integration. It only changes obj@assays$RNA@counts, @data and @scale.data.")
  RNA <- obj@assays$RNA
  
  if (nrow(RNA) == length(newnames)) {
    if (length(RNA@counts)) RNA@counts@Dimnames[[1]]            <- newnames
    if (length(RNA@data)) RNA@data@Dimnames[[1]]                <- newnames
    if (length(RNA@scale.data)) RNA@scale.data@Dimnames[[1]]    <- newnames
  } else {"Unequal gene sets: nrow(RNA) != nrow(newnames)"}
  obj@assays$RNA <- RNA
  return(obj)
}
obj=RenameGenesSeurat(obj = sce, 
                  newnames = ids$SYMBOL)
save(sce,file = 'first_sce.Rdata')

蛮简单的，接下来就可以继续可视化啦！

# T Cells (CD3D, CD3E, CD8A), 
# B cells (CD19, CD79A, MS4A1 [CD20]), 
# Plasma cells (IGHG1, MZB1, SDC1, CD79A), 
# Monocytes and macrophages (CD68, CD163, CD14),
# NK Cells (FGFBP2, FCG3RA, CX3CR1),  
# Photoreceptor cells (RCVRN), 
# Fibroblasts (FGF7, MME), 
# Endothelial cells (PECAM1, VWF). 
# epi or tumor (EPCAM, KRT19, PROM1, ALDH1A1, CD24).
#   immune (CD45+,PTPRC), epithelial/cancer (EpCAM+,EPCAM), 
# stromal (CD10+,MME,fibo or CD31+,PECAM1,endo) 

library(ggplot2) 
genes_to_check = c('PTPRC', 'CD3D', 'CD3E', 'CD4','CD8A','CD19', 'CD79A', 'MS4A1' ,
                   'IGHG1', 'MZB1', 'SDC1',
                   'CD68', 'CD163', 'CD14', 
                   'TPSAB1' , 'TPSB2',  # mast cells,
                   'RCVRN','FPR1' , 'ITGAM' ,
                   'FGF7','MME', 'ACTA2',
                   'PECAM1', 'VWF', 
                   'EPCAM' , 'KRT19', 'PROM1', 'ALDH1A1' )
library(stringr)  
p_all_markers <- DotPlot(sce , features = genes_to_check,
                         assay='RNA'  )  + coord_flip()

p_all_markers
ggsave(plot=p_all_markers, filename="check_all_marker_by_seurat_cluster.pdf")

可以很明显的看到：

我们做肿瘤研究的单细胞数据，一般来说会选择初步很粗狂的定义大的细胞亚群，比如我常用的第一次分群是通用规则是：

immune (CD45+,PTPRC),
epithelial/cancer (EpCAM+,EPCAM),
stromal (CD10+,MME,fibo or CD31+,PECAM1,endo)

根据上面的标记基因列表，我给出来的亚群是：

# 需要自行看图，定细胞亚群：  
celltype=data.frame(ClusterID=0:34,
                    celltype='unkown')
celltype[celltype$ClusterID %in% c( 0,2:4,6:8,10,15,17,17,19,31,32),2]='Tcells' 
celltype[celltype$ClusterID %in% c( 24 ),2]='plasma'
celltype[celltype$ClusterID %in% c( 22 ),2]='naive-Bcells'
celltype[celltype$ClusterID %in% c(21),2]='mast'

celltype[celltype$ClusterID %in% c( 1,5,12:14,18,23,27,33,34),2]='myeloid'
celltype[celltype$ClusterID %in% c( 9,16,26 ),2]='epithelial'

celltype[celltype$ClusterID %in% c( 11),2]='endo' 
celltype[celltype$ClusterID %in% c(20),2]='fibo'  
celltype[celltype$ClusterID %in% c(25 ),2]='SMC'  

head(celltype)
celltype 
table(celltype$celltype)

出图如下：

之所以我们有一堆unknown，是因为我对这个ccRCC的肿瘤并不熟悉，我的背景认知就是3个大亚群！所以就非常的尴尬，空有数据宝山，可以对它做任意分析，但是却不知道这个数据的生物医学意义！！！

大家对于这样的困局有没有什么破局方法？欢迎留言讨论

前面的例子：人人都能学会的单细胞聚类分群注释，我们演示了第一层次的分群。

如果你对单细胞数据分析还没有基础认知，可以看基础10讲：

ixxmu commented 1 year ago

https://github.com/ixxmu/mp_duty/issues/3914

附近这几篇都是注释基因的方法多种多样

ixxmu / mp_duty

构建seurat对象之初就应该是把基因名字转换好 #3913

构建seurat对象之初就应该是把基因名字转换好 by 生信技能树