单细胞各个亚群特异性高表达基因的数据库注释（包括GO,KEGG,ReactomePA）

单细胞各个亚群特异性高表达基因的数据库注释（包括GO,KEGG,ReactomePA） by 生信技能树

现在单细胞技术的大行其道，让我们的表达量矩阵里面的列也变成了成百上千甚至过万了，首先它们属于不同的样品，如果是10x技术的单细胞转录组，每个样品可以有5到8千的单细胞列，其次呢我们通常是做多个单细胞样品，详见：2个分组的单细胞项目标准分析，这样的话不同样品也可以有表型分组，比如处理组和对照组。

拿到了一个单细胞表达量矩阵，默认需要进行：单细胞聚类分群注释，如果你对单细胞数据分析还没有基础认知，可以看基础10讲：

找各个单细胞亚群特异性高表达量基因

主流的方法是FindAllMarkers函数，但是它比较耗费计算机资源和时间，也可以采用cosg方法，省时省力。

我们以大家熟知的pbmc3k数据集为例，大家先安装这个数据集对应的包，并且对它进行降维聚类分群，参考前面的例子：人人都能学会的单细胞聚类分群注释，而且每个亚群找高表达量基因，都存储为Rdata文件。标准代码是：

library(SeuratData) #加载seurat数据集  
getOption('timeout')
options(timeout=10000)
#InstallData("pbmc3k")  
data("pbmc3k")  
sce <- pbmc3k.final  
library(Seurat)
table(Idents(sce))
DimPlot(sce,label = T)

sce.markers <- FindAllMarkers(object = sce, only.pos = TRUE, 
                              min.pct = 0.25, 
                              thresh.use = 0.25)
library(dplyr) 
top3 <- sce.markers %>% group_by(cluster) %>% top_n(3, avg_log2FC)
DoHeatmap(sce ,top3$gene,size=3)

会得到热图，展现每个单细胞亚群特异性高表达量基因的前3个，它其实就是一种差异分析了，这个时候它对比的是每个单细胞亚群和所有的其它细胞。

如果是使用COSG包的cosg函数，是另外的格式的输出各个单细胞亚群的top基因：

  input_sce = sce
  table(Idents(input_sce))
  pro = 'cosg_seurat_clusters'
  
  if(T){
    
    library(COSG)
    marker_cosg <- cosg(
      input_sce,
      groups='all',
      assay='RNA',
      slot='data',
      mu=1,
      n_genes_user=100)
    
    save(marker_cosg,file = paste0(pro,'_marker_cosg.Rdata'))
    head(marker_cosg)
    
    ## Top10 genes
    library(dplyr)  
    top_10 <- unique(as.character(apply(marker_cosg$names,2,head,10)))
 
    sce.Scale <- ScaleData(input_sce ,features =  top_10  )  
    
    DoHeatmap(  sce.Scale , top_10 , 
   
    ## Top3 genes
    top_3 <- unique(as.character(apply(marker_cosg$names,2,head,3)))
     
  }

然后数据库注释（包括GO,KEGG,ReactomePA）

我这里自己写了一个简单的函数， com_go_kegg_ReactomePA_human ，可以根据输入的基因名字列表去批量做数据库注释（包括GO,KEGG,ReactomePA），这个函数 com_go_kegg_ReactomePA_human 的代码如下所示：



com_go_kegg_ReactomePA_human <- function(symbols_list ,pro){ 
  library(clusterProfiler)
  library(org.Hs.eg.db)
  library(ReactomePA)
  library(ggplot2)
  library(stringr)
  
  # 首先全部的symbol 需要转为 entrezID
  gcSample = lapply(symbols_list, function(y){ 
    y=as.character(na.omit(select(org.Hs.eg.db,
                                  keys = y,
                                  columns = 'ENTREZID',
                                  keytype = 'SYMBOL')[,2])
    )
    y
  })
  gcSample
  
  # 第1个注释是 KEGG 
  xx <- compareCluster(gcSample, fun="enrichKEGG",
                       organism="hsa", pvalueCutoff=0.05)
  dotplot(xx)  + 
    scale_y_discrete(labels=function(x) str_wrap(x, width=50)) 
  ggsave(paste0(pro,'_kegg.pdf'),width = 10,height = 8)
  
  # 第2个注释是 ReactomePA 
  xx <- compareCluster(gcSample, fun="enrichPathway",
                       organism = "human",
                  pvalueCutoff=0.05)
  dotplot(xx)  + 
    scale_y_discrete(labels=function(x) str_wrap(x, width=50)) 
  ggsave(paste0(pro,'_ReactomePA.pdf'),width = 10,height = 8)
  
  # 然后是GO数据库的BP,CC,MF的独立注释
  # Run full GO enrichment test for BP 
  formula_res <- compareCluster(
    gcSample, 
    fun="enrichGO", 
    OrgDb="org.Hs.eg.db",
    ont     = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    qvalueCutoff  = 0.05
  )
  
  # Run GO enrichment test and merge terms 
  # that are close to each other to remove result redundancy
  lineage1_ego <- simplify(
    formula_res, 
    cutoff=0.5, 
    by="p.adjust", 
    select_fun=min
  ) 
  pdf(paste0(pro,'_GO_BP_cluster_simplified.pdf') ,width = 15,height = 8)
  print(dotplot(lineage1_ego, showCategory=5) + 
          scale_y_discrete(labels=function(x) str_wrap(x, width=50)) )
  dev.off() 
  write.csv(lineage1_ego@compareClusterResult, 
            file=paste0(pro,'_GO_BP_cluster_simplified.csv'))
  # Run full GO enrichment test for CC 
  formula_res <- compareCluster(
    gcSample, 
    fun="enrichGO", 
    OrgDb="org.Hs.eg.db",
    ont     = "CC",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    qvalueCutoff  = 0.05
  )
  
  # Run GO enrichment test and merge terms 
  # that are close to each other to remove result redundancy
  lineage1_ego <- simplify(
    formula_res, 
    cutoff=0.5, 
    by="p.adjust", 
    select_fun=min
  ) 
  pdf(paste0(pro,'_GO_CC_cluster_simplified.pdf') ,width = 15,height = 8)
  print(dotplot(lineage1_ego, showCategory=5) + 
          scale_y_discrete(labels=function(x) str_wrap(x, width=50)) )
  dev.off() 
  write.csv(lineage1_ego@compareClusterResult, 
            file=paste0(pro,'_GO_CC_cluster_simplified.csv'))
  
  # Run full GO enrichment test for MF 
  formula_res <- compareCluster(
    gcSample, 
    fun="enrichGO", 
    OrgDb="org.Hs.eg.db",
    ont     = "MF",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    qvalueCutoff  = 0.05
  )
  
  # Run GO enrichment test and merge terms 
  # that are close to each other to remove result redundancy
  lineage1_ego <- simplify(
    formula_res, 
    cutoff=0.5, 
    by="p.adjust", 
    select_fun=min
  ) 
  pdf(paste0(pro,'_GO_MF_cluster_simplified.pdf') ,width = 15,height = 8)
  print(dotplot(lineage1_ego, showCategory=5) + 
          scale_y_discrete(labels=function(x) str_wrap(x, width=50)) )
  dev.off() 
  write.csv(lineage1_ego@compareClusterResult, 
            file=paste0(pro,'_GO_MF_cluster_simplified.csv'))
  
}

这样的话，无论是什么样的项目，使用起来都非常简洁：

symbols_list <-  as.list(as.data.frame(apply(marker_cosg$names,2,head,100)))
source('com_go_kegg_ReactomePA_human.R')
com_go_kegg_ReactomePA_human(symbols_list, pro='pbmc' )

可以简单的看看结果，首先是kegg数据库的注释，可以得到结果的可解释性还是蛮强的：

kegg数据库的注释

然后是ReactomePA数据库的注释结果，也是很合理啦：

ReactomePA数据库

后面还有GO数据库的CC,BP,MF的图，我就不一一展示了，因为这个PBMC非常出名，但凡是有这方面生物学背景的，都很容易理解这些单细胞各个亚群特异性高表达基因的数据库注释（包括GO,KEGG,ReactomePA），可以作为你单细胞分群准确性的一个辅助证明。

文末友情推荐：毛遂自荐成为你的单细胞顾问

联系方式详见：毛遂自荐成为你的单细胞顾问

单细胞数据标准分析我们做的很多，但是无穷无尽的个性化分析，我们只能做到模仿，很难创新。而且我不是汤富酬张泽民这样的单细胞旗手，仅仅是带领学徒做了一下单细胞文献图表复现，只能做到模仿。文献有的我才会，不能凭空创造概念，只能说把自己看过的几百篇单细胞文献的共性整理，基于它们来对你进行指导哦。

作为单细胞顾问，我可以提供的服务仅限于：

你可以发5~10篇相关领域文献给我，预计耗时3小时我简单读一下，认识你的课题背景，收费3000块钱
你把自己的单细胞转录组数据分析做一个ppt给我，开腾讯会议我听你的讲解，互动一个小时，讨论一个小时，合计收费 2000元。
单细胞数据分析流程检查（降维聚类分群，亚群比例差异，表达量差异），耗时半个小时，收费2000块钱
其它高级分析的debug，单项收费1600

ixxmu / mp_duty