seurat v5全流程—harmmony整合+标准分析+细胞注释+批量差异、富集分析(seurat读取多个txt文件)

seurat v5全流程—harmmony整合+标准分析+细胞注释+批量差异、富集分析(seurat读取多个txt文件) by 生信菜鸟团

大家好，本推文是为了测试流程的代码，我在Jimmy老师的代码中比较难理解的地方做了注释，富集分析部分做了魔改，欢迎点赞收藏学习。后续还会加上cellchat细胞通讯、monocle拟时序分析、pyscenic转录因子分析。

后面会有一场直播：如何学习、使用本代码。可以关注下。

背景：升主动脉瘤（ATAA）是由主动脉壁逐渐变弱和扩张引起的，可能导致主动脉夹层、破裂和其他危及生命的并发症。为了提高我们对ATAA发病机制的理解，我们试图全面描述升主动脉壁的细胞组成，并鉴定人类ATAA组织中每个细胞群体的分子变化。

方法：我们对来自11名研究参与者的升主动脉组织进行了单细胞RNA测序（sc-RNAseq）分析，其中包括8名ATAA患者（4名女性和4名男性）和3名对照组（2名女性和1名男性）。利用sc-RNAseq数据对从主动脉组织中提取的细胞进行分析和分类，以执行聚类鉴定。然后通过比较ATAA和对照组组织中每个细胞类型的比例和基因表达谱来检查与ATAA相关的变化。我们还通过将我们的sc-RNAseq数据与公开的全基因组关联研究（GWAS）数据进行整合分析，确定可能对ATAA至关重要的基因。

结果：我们鉴定了人类升主动脉组织中的11种主要细胞类型；对这些细胞进行高分辨率的再聚类将其细分为40个亚型。平滑肌细胞、巨噬细胞和T淋巴细胞观察到多个亚型，表明这些细胞在主动脉壁中具有多个功能性群体。一般而言，ATAA组织中非免疫细胞较少，免疫细胞尤其是T淋巴细胞较多，与对照组组织相比。差异基因表达数据提示ATAA组织中存在广泛的线粒体功能障碍。此外，我们的sc-RNAseq数据与公开的GWAS数据和启动子捕获Hi-C数据的整合分析表明ERG（ETS相关基因）在维持正常主动脉壁功能中发挥重要作用。

在这篇文章其中，我们使用其中两个样本进行本次全流程代码测试：

GSM4704934

GSM4704931

根据GEO的分组信息，我们可以看到GSM4704931是control组，那另外一个就是ATAA组

本推文主要分为四个部分：

第一步：导入数据，harmony整合样本降维聚类分群后，细胞注释
第二步：基于cluster或者celltype的批量差异分析，批量绘制火山图
第三步：对差异基因进行批量富集分析：GO、KEGG
第四步：展示不同组别中的细胞比例及其他基础图形

第一步：导入数据，harmony整合样本降维聚类分群后，细胞注释

## ### ---------------###### Create: Jianming Zeng### Date:  2023-01-16 20:53:22### Email: jmzeng1314@163.com### Blog: http://www.bio-info-trainee.com/### Forum:  http://www.biotrainee.com/thread-1376-1-1.html### CAFS/SUSTC/Eli Lilly/University of Macau### Update Log: 2023-01-16  First version ###### ---------------
#########################设置我自己的r包加载路径，通常你不需要运行这段代码-----.libPaths(c(  '/home/rootyll/seurat_v5/',  "/usr/local/lib/R/site-library", "/usr/lib/R/site-library", "/usr/lib/R/library"));print(getwd())###################################################################################333



rm(list=ls())options(stringsAsFactors = F) source('scRNA_scripts/lib.R')
###### step1:导入数据 ######   # 付费环节 800 元人民币 # https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE155468dir='inputs/' samples=list.files( dir ,pattern = 'gz')samples library(data.table)ctList = lapply(samples,function(pro){   # pro=samples[1]   print(pro)  ct=fread(file.path( dir ,pro),data.table = F)  ct[1:4,1:4]  rownames(ct)=ct[,1]  colnames(ct) = paste(gsub('.txt.gz','',pro),                       colnames(ct) ,sep = '_')  ct=ct[,-1]   return(ct)})
#1检查行列数目，目的是找交集基因-----lapply(ctList, dim)
tmp =table(unlist(lapply(ctList, rownames)));head(tmp)
cg = names(tmp)[tmp==length(samples)];head(cg)
#2合并矩阵-----bigct = do.call(cbind,                lapply(ctList,function(ct){                   ct = ct[cg,]                   return(ct)                }))dim(bigct)
sce.all=CreateSeuratObject(counts =  bigct,                            min.cells = 5,                           min.features = 300)sce.allas.data.frame(sce.all@assays$RNA$counts[1:10, 1:2])head(sce.all@meta.data, 10)table(sce.all@meta.data$orig.ident) 

###### step2:QC质控 ######dir.create("./1-QC")setwd("./1-QC")# 如果过滤的太狠，就需要去修改这个过滤代码source('../scRNA_scripts/qc.R')sce.all.filt = basic_qc(sce.all)print(dim(sce.all))print(dim(sce.all.filt))setwd('../')

sp='human'
###### step3: harmony整合多个单细胞样品 ######dir.create("2-harmony")getwd()setwd("2-harmony")source('../scRNA_scripts/harmony.R')# 默认 ScaleData 没有添加"nCount_RNA", "nFeature_RNA"# 默认的sce.all.int = run_harmony(sce.all.filt)setwd('../')

###### step4: 降维聚类分群和看标记基因库 ####### 原则上分辨率是需要自己肉眼判断，取决于个人经验# 为了省力，我们直接看  0.1和0.8即可table(Idents(sce.all.int))table(sce.all.int$seurat_clusters)table(sce.all.int$RNA_snn_res.0.1) table(sce.all.int$RNA_snn_res.0.8) 
getwd()dir.create('check-by-0.1')setwd('check-by-0.1')sel.clust = "RNA_snn_res.0.1"sce.all.int <- SetIdent(sce.all.int, value = sel.clust)table(sce.all.int@active.ident) 
source('../scRNA_scripts/check-all-markers.R')setwd('../') getwd()
print(getwd())dir.create('check-by-0.8')setwd('check-by-0.8')sel.clust = "RNA_snn_res.0.8"sce.all.int <- SetIdent(sce.all.int, value = sel.clust)table(sce.all.int@active.ident) source('../scRNA_scripts/check-all-markers.R')setwd('../') getwd()
last_markers_to_check

###### step5: 确定单细胞亚群生物学名字 ####### 一般来说，为了节省工作量，我们选择0.1的分辨率进行命名# 因为命名这个步骤是纯人工 操作# 除非0.1确实分群太粗狂了，我们就选择0.8 source('scRNA_scripts/lib.R')#sce.all.int = readRDS('2-harmony/sce.all_int.rds')colnames(sce.all.int@meta.data) table(sce.all.int$RNA_snn_res.0.8)



# 付费环节 800 元人民币print(getwd())if(T){  sce.all.int  table(Idents(sce.all.int))  #选择res0.1 分辨率，进行细胞亚群命名-----  Idents(sce.all.int)=sce.all.int$RNA_snn_res.0.1  table(Idents(sce.all.int))


  celltype=data.frame(ClusterID=0:8 ,                      celltype= 0:8 )   #定义细胞亚群          celltype[celltype$ClusterID %in% c( 0, 8 ),2]='Tcells'  celltype[celltype$ClusterID %in% c(3 ),2]='NK'   
  celltype[celltype$ClusterID %in% c( 7 ),2]='Bcells'   celltype[celltype$ClusterID %in% c( 2,5 ),2]='myeloids'    celltype[celltype$ClusterID %in% c(6,1 ),2]='fibro'  celltype[celltype$ClusterID %in% c( 4 ),2]='endo'  

  head(celltype)  celltype  table(celltype$celltype)  sce.all.int@meta.data$celltype = "NA"
  for(i in 1:nrow(celltype)){    sce.all.int@meta.data[which(sce.all.int@meta.data$RNA_snn_res.0.1 == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}  Idents(sce.all.int)=sce.all.int$celltype
  sel.clust = "celltype"  sce.all.int <- SetIdent(sce.all.int, value = sel.clust)  table(sce.all.int@active.ident) 
  dir.create('check-by-celltype')  setwd('check-by-celltype')  source('../scRNA_scripts/check-all-markers.R')  setwd('../')   getwd()
  # p_all_markers=DotPlot(sce.all.int, group.by = 'celltype',  #                       features = gene_list,  #                       scale = T,assay='RNA' )+    #   theme(axis.text.x=element_text(angle=45,hjust = 1))  # p_all_markers  # ggsave('check_paper_markers_celltype.pdf',  #        height = 8,width = 12)  print(getwd())}
#在meta信息中添加分组信息sce.all.int@meta.data$group = ifelse(  sce.all.int@meta.data$orig.ident %in% c( 'GSM4704931'),'control','ATAA'  )phe=sce.all.int@meta.data;head(phe);print(getwd())
save(phe,file = 'phe.Rdata')
pdf('orig.ident-vs-celltype.pdf',height = 18)gplots::balloonplot(table(sce.all.int$celltype,sce.all.int$orig.ident))dev.off()

会得到如下文件

画圈圈的主要用于细胞命名，这里主要是注释大类亚群，如果你想要个性化注释亚群：

第二步：基于cluster或者celltype的差异分析，绘制火山图


#########################设置我自己的r包加载路径，通常你不需要运行这段代码-----.libPaths(c(  '/home/rootyll/seurat_v5/',  "/usr/local/lib/R/site-library", "/usr/lib/R/site-library", "/usr/lib/R/library"));print(getwd())###################################################################################333


library(Seurat)

#setwd('../')
#1 在主文件夹下-----source('scRNA_scripts/lib.R')print(getwd())print(.libPaths())

#setwd("../")sce.all.int = readRDS('2-harmony/sce.all_int.rds')


sp='human'colnames(sce.all.int@meta.data) table(sce.all.int@meta.data$RNA_snn_res.0.8)table(sce.all.int@meta.data$RNA_snn_res.0.1)
# 分辨率取决于实际情况# 简化，就选择 RNA_snn_res.0.1# gplots::balloonplot(table(sce.all.int@meta.data$RNA_snn_res.0.8,sce.all.int@meta.data$orig.ident))# gplots::balloonplot(table(sce.all.int@meta.data$RNA_snn_res.0.1,sce.all.int@meta.data$orig.ident))#1差异分析时，使用group分组信息，去文献里面看--------table(sce.all.int$orig.ident)sce.all.int@meta.data$group = ifelse(  sce.all.int@meta.data$orig.ident %in% c( 'GSM4704931'),'control','ATAA'  )

table(sce.all.int@meta.data$group) #差异分析时，使用group分组信息--------getwd()
load('phe.Rdata');head(phe)
sce.all.int@meta.data$celltype = phe$celltypesce.all = sce.all.intgetwd()dir.create("./4_deg",recursive = T)setwd("./4_deg/")getwd()


#获取每个细胞类型的barcodes------cell_list = split(colnames(sce.all),sce.all@meta.data$celltype)cell_listnames(cell_list)
table(sce.all$group,sce.all$celltype)print(getwd())
dir.create("./by_celltype")setwd("./by_celltype/")getwd()library(ggplot2)
degs_list=list()for ( pro in names(cell_list) ) {  # pro =  names(cell_list)[10]  print(pro)
  sce= sce.all[, colnames(sce.all) %in%  cell_list[[pro]]]  sce  sce <- CreateSeuratObject(counts = sce@assays$RNA$counts,                             meta.data = sce@meta.data,                             min.cells = 3,                             min.features = 200)    sce <- NormalizeData(sce)    sce = FindVariableFeatures(sce)  sce = ScaleData(sce,                   vars.to.regress = c("nFeature_RNA",                                      "percent_mito"))
  Idents(sce)=sce$group  table(Idents(sce))



if (    length(  table(Idents(sce))  ) ==1 )  next  {    # 如果细胞 种类数量不为1     print(paste0("注意，你的组别有 ",length(  table(Idents(sce))  ) ,' 个。我们下面进行差异分析'))  #-       print(paste0('您的组别是：' ,unique(sce$group),"   "))    # 提示用户输入     # control_name <- readline("请输入您的control组名称(如不知道名称，请按ESC，退出) : ")    # exp_name <- readline("请输入您exp组名称(如不知道名称，请按ESC，退出) :")    #     control_name=unique(sce$group)[1]    exp_name=unique(sce$group)[2]    degs=FindMarkers(sce,ident.1 = control_name,ident.2 = exp_name)    degs$gene=rownames(degs);degs$symbol=rownames(degs)    degs$comparation=paste0(control_name,'_vs_',exp_name)    degs$celltype=pro    degs$cluster=degs$celltype    print(head(degs))
    res=degs       p=EnhancedVolcano::  EnhancedVolcano(res,                       lab = res$symbol,                       x = 'avg_log2FC',                       y = 'p_val',pCutoff = 0.05,                       FCcutoff =1) #######这里默认使用|fc|为1---------       ggsave(file = paste0(pro,'deg_volcano.pdf' ),plot = p,width = 8,height = 6)

       degs_list[[paste(pro)]]=degs
       library(dplyr)        sce.markers =degs

       # 筛选出p_val_adj小于0.05的数据       filtered_data <- sce.markers[sce.markers$p_val_adj < 0.05, ] 
       # 提取avg_log2FC列的前10行和后10行       top_rows <-  filtered_data %>% group_by(cluster ) %>% slice_max(avg_log2FC,n = 10)         bottom_rows <-  filtered_data %>% group_by(cluster ) %>% slice_min(avg_log2FC,n = 10)  
       # 合并前10行和后10行       top10 <- rbind(top_rows, bottom_rows)




        DoHeatmap(sce,top10$gene,size=3,draw.lines=F)       ggsave(filename=paste0(pro,'_sce.markers_heatmap.pdf'),              height = 3,width = 5)
       p <- DotPlot(sce , features = unique(top10$gene),                    assay='RNA'  )  + coord_flip()
       p       ggsave(plot=p,               filename=paste0("check_top10-degs_marker_by_",                              pro,"_cluster.pdf"), height = 4,width = 6 );print(getwd())
  }
}
degs_allcelltype_df=do.call(rbind,degs_list)degs_allcelltype_df %>%head()

write.csv(degs_allcelltype_df,file = "./degs_allcelltype_df.csv")saveRDS(degs_allcelltype_df,file = "./allcelltype_degs_sce.markers.Rdata")



head(degs_allcelltype_df);print(getwd())


library(scRNAtoolVis)#library(colourpicker)degs_allcelltype_df$cluster=degs_allcelltype_df$celltypep2=scRNAtoolVis::jjVolcano(diffData = degs_allcelltype_df,                        log2FC.cutoff =  2,                        adjustP.cutoff = 0.05,          legend.position = c(0.93, 0.99),           topGeneN=0,#top genes to be labeled in plot, default 5.        #  cluster.order=seq(0,23,1),          pSize=0.4,          tile.col = c("#EE3B3B", "#FF7F00", "#CD6600", "#8B2323", "#DEB887", "#76EEC6",                        "#F0FFFF", "#008B8B", "#FFB90F", "#F5F5DC", "#1F1F1F", "#66CD00",                        "#0000FF", "#97FFFF", "#528B8B", "#9400D3", "#EE1289", "#00BFFF",                        "#00FF00", "#191970", "#FFFF00", "#4A708B", "#00FF7F", "#8B8B00",                        "#FF1493", "#FFA500", "#8B4513")) 


ggsave(plot = p2,filename = "allcelltypes_degs_volcano.pdf",width = 9,height = 6)print(getwd())
setwd('../../')
#

最后还贴心的对差异基因进行了汇总。同时你可以根据保存的rdata文件进行更个性化的绘图，如下图

第三步：对差异基因进行批量富集分析：GO、KEGG


#########################设置我自己的r包加载路径，通常你不需要运行这段代码-----.libPaths(c(  '/home/rootyll/seurat_v5/',  "/usr/local/lib/R/site-library", "/usr/lib/R/site-library", "/usr/lib/R/library"));print(getwd())###################################################################################333




# 1.加载R包 ----
rm(list = ls())options(stringsAsFactors = F)library(clusterProfiler)# 这里没有判断物种选择，默认人类# 如果是其他，需要修改不少地方library(org.Hs.eg.db) library(dplyr) library(Seurat)
# 2.设置路径 ----#setwd("../")getwd() #在主目录-------
dir.create("5_GO_KEGG")setwd("5_GO_KEGG/")
# 3.all 读取数据富集分析----## 3.1 kegg by cluster 可视化 ----
f = '../check-by-0.1/qc-_marker_cosg.Rdata'
if(   file.exists(f)   ){
  load(file = f);print(getwd())  lapply(marker_cosg,head)  # sce.markers=sce.markers[sce.markers$avg_log2FC > 0,]  # top100 <- sce.markers %>% group_by(cluster) %>% top_n(1000, avg_log2FC)  top100 <- unique(as.character(   apply(marker_cosg$names,2,head,100 )    ))  top100 <- as.list(marker_cosg$names)
  top100 <-  data.frame( cluster=rep(names(top100),                                     each=unlist(lapply(top100, length))  ),              gene=unlist(top100)      )
  table(top100$cluster)   head(top100)   library(ggplot2)  ids= bitr(top100$gene,'SYMBOL','ENTREZID','org.Hs.eg.db')  top100=merge(top100,ids,by.x='gene',by.y='SYMBOL');print(head(top100))
  gcSample=split(top100$ENTREZID, top100$cluster)  gcSample # entrez id , compareCluster  xx <- compareCluster(gcSample, fun="enrichGO",                       OrgDb='org.Hs.eg.db');go=xx

  p=dotplot(xx)   p+ theme(axis.text.x = element_text(angle = 45,                                       vjust = 0.5, hjust=0.5))
  ggsave('compareCluster_enrichgo_-top100-cluster.pdf',width = 18,height = 8,plot = p)


  xx <- compareCluster(gcSample, fun="enrichKEGG",                       organism="hsa");kegg=xx  p=dotplot(xx)   p+ theme(axis.text.x = element_text(angle = 45,                                       vjust = 0.5, hjust=0.5))  ggsave('compareCluster_enrichKEGG_-top100-cluster.pdf',width = 18,height = 8,plot = p)

  save(kegg,go,file = "kegg_go_cluster.Rdata")
}
print(getwd())## 3.2 kegg by celltype 可视化 ----
f = '../check-by-celltype/qc-_marker_cosg.Rdata'
if(  file.exists(f)  ){  print(getwd())
  load(file = f )  # sce.markers=sce.markers[sce.markers$avg_log2FC > 0,]  # top100 <- sce.markers %>% group_by(cluster) %>% top_n(1000, avg_log2FC)  top100 <- unique(as.character(apply(marker_cosg$names,2,head,100)))  top100 <- as.list(marker_cosg$names)  top100 <-  data.frame(cluster=rep(names(top100),each=unlist(lapply(top100, length))),                        gene=unlist(top100))  table(top100$cluster)   head(top100)   library(ggplot2)  ids=bitr(top100$gene,'SYMBOL','ENTREZID','org.Hs.eg.db')  top100=merge(top100,ids,by.x='gene',by.y='SYMBOL')  gcSample=split(top100$ENTREZID, top100$cluster)  gcSample # entrez id , compareCluster   xx <- compareCluster(gcSample, fun="enrichKEGG",                       organism="hsa")  kegg=xx  p=dotplot(xx)   p+ theme(axis.text.x = element_text(angle = 90,                                       vjust = 0, hjust=1))  ggsave('compareCluster_enrichKEGG-top100-celltype.pdf',width = 12,height = 12)  ## 3.3 Go by cluster 可视化 ----  gcSample  xx <- compareCluster(gcSample, fun="enrichGO",                       OrgDb='org.Hs.eg.db');go=xx  summary(xx)  p=dotplot(xx)   p+ theme(axis.text.x = element_text(angle = 90,                                       vjust = 0, hjust=1))  ggsave('compareCluster-enrichGO-top100-celltype.pdf',width = 15,height = 12)
 save(kegg,go,file = "kegg_go_celltype.Rdata")
}


## 4.2 GO_KEGG富集分析 -------------------------------------------------------------library(clusterProfiler)library(org.Hs.eg.db)library(EnhancedVolcano)library(ggplot2)library(stringr) print(getwd())
# rich_list <- list()# dir.create("EnhancedVolcano")# dir.create("KEGG_enrichment")# dir.create("GO_enrichment")# dir.create("DEG")
fs = list.files(path = '../4_deg/by_celltype/',                recursive = T,full.names = T,                pattern = 'Rdata')fs
# lapply(fs, function(x){#   #x=fs[1]#   load(x)#   table(sce.markers$cluster)#   write.csv(sce.markers,file = gsub('.Rdata','.csv',x))# })



# 如果是两个分组，我们前面的deg就可以进行火山图等绘制 fs = list.files(path = '../4_deg/',                recursive = T,full.names = T,                pattern = 'sce.markers.Rdata')fs
if (F) {
  lapply(fs, function(x){    # x=fs[1]    load(x)    table(sce.markers$cluster)     need_DEG= sce.markers     need_DEG$avg_log2FC=need_DEG$avg_log2FC* ifelse(as.numeric( sce.markers$cluster )-1,1,-1)     colnames(need_DEG)
    logFC_cutoff=0.25    need_DEG$change = as.factor(ifelse(need_DEG$p_val< 0.05 & abs(need_DEG$avg_log2FC) > logFC_cutoff,                                       ifelse(need_DEG$avg_log2FC > logFC_cutoff ,'UP','DOWN'),'NOT')    )    table(  need_DEG$change )    need_DEG$symbol=row.names(need_DEG)

    # 差异基因火山图    res=need_DEG    head(res)    EnhancedVolcano(res,                    lab = res$symbol,                    x = 'avg_log2FC',                    y = 'p_val',pCutoff = 0.05,                    FCcutoff =logFC_cutoff )    ggsave(file = gsub('.Rdata','_Volcano.pdf',x))
  })

}print(getwd())setwd("../")print(getwd())

这一步的代码也保留了富集分析的所有数据，如果你想更个性化或者挑选富集分析结果自行展示：手动选择富集分析结果，真的不合理吗？

甚至你还可以选择目的term中的所有基因进行热图展示：单细胞|提取富集分析通路基因画个性化热图

第四步：展示不同组别中的细胞比例


#########################设置我自己的r包加载路径，通常你不需要运行这段代码-----.libPaths(c(  '/home/rootyll/seurat_v5/',  "/usr/local/lib/R/site-library", "/usr/lib/R/site-library", "/usr/lib/R/library"))###################################################################################333

# 1.加载R包 ----rm(list=ls()) library(ggplot2) library(cowplot) library(paletteer)  library(gplots)library(ggpubr)    library(ggsci) library(stringr)

##1 在主目录 ------------#setwd('../')getwd() 
dir.create("./4_group")setwd("./4_group/")
getwd() 
# 3.读取数据 ----load("../phe.Rdata");print(getwd())#sce.all.int = readRDS('../2-harmony/sce.all_int.rds')head(phe)table(phe$orig.ident)


#在meta信息中添加分组信息# sce.all.int@meta.data$group = ifelse(#   sce.all.int@meta.data$orig.ident %in% c( 'GSM4704931'),'control','ATAA'  )# phe=sce.all.int@meta.data;head(phe);print(getwd())phe$group=ifelse(  phe$orig.ident %in% c( 'GSM4704931'),'control','ATAA'  )
table(phe$group) 

# 4.可视化 ----## 4.1 每种细胞类型中，分组所占比例 ----library(tidyr)# 使用的gather & spreadlibrary(reshape2) # 使用的函数 melt & dcast library(dplyr)library(ggplot2)tb=table(phe$celltype,         phe$group)
tbstr(tb)balloonplot(tb)bar_data <- as.data.frame(tb);bar_data
bar_per <- bar_data %>%   group_by(Var1) %>%  mutate(sum(Freq)) %>% #分组求和  mutate(percent = Freq / `sum(Freq)`) #计算每种细胞类型在各种分组里所占比例head(bar_per) ;bar_perwrite.csv(bar_per,file = "celltype_by_group_percent.csv");print(getwd())

ggplot(bar_per, aes(x = Var1, y = percent)) +  geom_bar(aes(fill = Var2) , stat = "identity") + coord_flip() +  theme(axis.ticks = element_line(linetype = "blank"),        legend.position = "top",        panel.grid.minor = element_line(colour = NA,linetype = "blank"),         panel.background = element_rect(fill = NA),        plot.background = element_rect(colour = NA)) +  labs(y = "% Relative cell source", fill = NULL)+labs(x = NULL)+  scale_fill_d3()print(getwd())ggsave("celltype_by_group_percent.pdf",       units = "cm",width = 20,height = 12)
## 4.2 每种细胞类型中，各个样本所占比例 ----bar_data <- as.data.frame(table(phe$celltype,phe$orig.ident));bar_data
bar_per <- bar_data %>%   group_by(Var1) %>%  mutate(sum(Freq)) %>%  mutate(percent = Freq / `sum(Freq)`);bar_per
write.csv(bar_per,file = "celltype_by_orig.ident_percent.csv")ggplot(bar_per, aes(x = Var1, y = percent)) +  geom_bar(aes(fill = Var2) , stat = "identity") + coord_flip() +  theme(axis.ticks = element_line(linetype = "blank"),        legend.position = "top",        panel.grid.minor = element_line(colour = NA,linetype = "blank"),         panel.background = element_rect(fill = NA),        plot.background = element_rect(colour = NA)) +  labs(y = "% Relative cell source", fill = NULL)+labs(x = NULL) 
ggsave("celltype_by_orig.ident_percent.pdf",units = "cm",       width = 20,height = 12)
## 绘图函数，运行即可plot_percent <- function(x,y){  # x <- "group"  # y <- "celltype"  plot_data <- data.frame(table(phe[, x ],                                phe[, y ]))
  plot_data$Total <- apply(plot_data,1,function(x) sum( plot_data[plot_data$Var1 == x[1],3] ) )  plot_data <- plot_data %>% mutate(Percentage = round(Freq/Total,3) * 100)
  pro <- x  write.table(plot_data,paste0(pro,"_celltype_proportion.txt"),quote = F)  th=theme(axis.text.x = element_text(angle = 45,                                       vjust = 0.5, hjust=0.5))   library(paletteer)   color <- c(paletteer_d("awtools::bpalette"),paletteer_d("awtools::a_palette"),paletteer_d("awtools::mpalette"))  ratio1 <- ggplot(plot_data,aes(x = Var1,y = Percentage,fill = Var2)) +    geom_bar(stat = "identity",position = "stack") +    scale_fill_manual(values = color)+    theme_classic() +     theme(axis.title.x = element_blank()) + labs(fill = "Cluster") +th   ratio1  f=paste0('ratio_by_',x,'_VS_',y)  h=floor(5+length(unique(plot_data[,1]))/2)  w=floor(3+length(unique(plot_data[,2]))/2)
  ggsave(paste0('bar1_',f,'.pdf'),ratio1,height = h ,width = w )   pdf(paste0('balloonplot_',f,'.pdf'),height = 12 ,width = 20)  balloonplot(table(phe[, x ],                    phe[, y ]))  dev.off()
  plot_data$Total <- apply(plot_data,1,function(x)sum(plot_data[plot_data$Var1 == x[1],3]))  plot_data<- plot_data %>% mutate(Percentage = round(Freq/Total,3) * 100)  bar_Celltype=ggplot(plot_data,aes(x = Var1,y = Percentage,fill = Var2)) +    geom_bar(stat = "identity",position = "stack") +    theme_classic() +     theme(axis.text.x=element_text(angle=45,hjust = 1)) + labs(fill = "Cluster")+    facet_grid(~Var2,scales = "free")  bar_Celltype  ggsave(paste0('bar2_',f,'.pdf'),bar_Celltype,height = 8 ,width =  40) }
## 4.3 每个分组中，不同细胞类型所占比例 ----
plot_percent("group","celltype")
## 4.4 每个分组中，不同细胞类型所占比例 ----plot_percent("orig.ident","celltype")
## 4.5 每个细胞类型中，不同分组所占比例箱型图 ----colnames(phe)
lapply(c("group"), function(pro){
  tb <- data.frame(table(phe$celltype,phe$orig.ident,                         phe[,pro]))
  tb=tb[which(tb$Freq != 0),]  tb=tb[,c(1,3,4)]  head(tb) 
  tb$Total <- apply(tb,1,function(x)sum(tb[tb$Var1 == x[1],3]))  tb<- tb %>% mutate(Percentage = round(Freq/Total,3) * 100)  tb=tb[,c(1,2,5)]  tb$Var1=as.factor(tb$Var1)  tb$Var3=as.factor(tb$Var3)
  tb  write.table(tb,paste0(pro,"_celltype_proportion.txt"),quote = F)
  library(paletteer)   color <- c(paletteer_d("awtools::bpalette"),paletteer_d("awtools::a_palette"),paletteer_d("awtools::mpalette"))
  p <- ggplot(tb,aes(x=Var3,y=Percentage,fill=Var1,color=Var1))+    geom_boxplot(outlier.alpha=0 )+facet_grid(~Var1)+    geom_jitter(aes(x = Var3, y = Percentage ))+    stat_compare_means(aes(group = Var3),                        # label = "p.signif"                       label = "p.format",label.y = 50)+     scale_fill_igv(alpha = 0.6)+scale_color_igv(alpha = 0.8)+    theme_bw()+    theme(axis.text.x = element_text(face = "bold",angle = 45, hjust=1, vjust=0.5,size = 14),           legend.position= "none",          strip.background = element_rect(color="black",fill = "steelblue3", linetype = "solid"),          strip.text = element_text(face = "bold", color = "black",hjust = 0.5, size = 12,),          plot.title=element_text(face="bold.italic",size="20", color="brown",hjust = 0.5),          axis.text.y = element_text(face = "bold",size = 14) ,          axis.title = element_text(size = 14),          panel.grid.major = element_blank(),          panel.grid.minor = element_blank())+    labs(title = paste0("The proportion of  cell types in ",pro),         x = "")  # scale_color_d3(palette= "category20") #色卡 ，使用?scale_color_d3()查看  p    ggsave(paste0(pro,"_boxplot.tiff"),width = 24,height = 10)  ggsave(paste0(pro,"_boxplot.pdf"),width = 24,height = 10)})

## 4.6 每个cluster中，不同分组所占比例箱型图 ----colnames(phe)lapply(c("group"), function(pro){
  # phe$RNA_snn_res.0.8，注意要和分群注释保持一致  tb <- data.frame(table(phe$RNA_snn_res.0.8,phe$orig.ident,                         phe[,pro]))
  tb=tb[which(tb$Freq != 0),]  tb=tb[,c(1,3,4)]  head(tb) 
  tb$Total <- apply(tb,1,function(x)sum(tb[tb$Var1 == x[1],3]))  tb<- tb %>% mutate(Percentage = round(Freq/Total,3) * 100)  tb=tb[,c(1,2,5)]  tb$Var1=as.factor(tb$Var1)  tb$Var3=as.factor(tb$Var3)
  tb  write.table(tb,paste0(pro,"_celltype_proportion.txt"),quote = F)  p <- ggplot(tb,aes(x=Var3,y=Percentage,fill=Var1,color=Var1))+    geom_boxplot(outlier.alpha=0 )+facet_grid(~Var1)+    geom_jitter(aes(x = Var3, y = Percentage ))+    stat_compare_means(aes(group = Var3),                        # label = "p.signif"                       label = "p.format",label.y = 50)+     scale_fill_igv(alpha = 0.6)+scale_color_igv(alpha = 0.8)+    theme_bw()+    theme(axis.text.x = element_text(face = "bold",angle = 45, hjust=1, vjust=0.5,size = 14),           legend.position= "none",          strip.background = element_rect(color="black",fill = "steelblue3", linetype = "solid"),          strip.text = element_text(face = "bold", color = "black",hjust = 0.5, size = 12,),          plot.title=element_text(face="bold.italic",size="20", color="brown",hjust = 0.5),          axis.text.y = element_text(face = "bold",size = 14) ,          axis.title = element_text(size = 14),          panel.grid.major = element_blank(),          panel.grid.minor = element_blank())+    labs(title = paste0("The proportion of  cell types in ",pro),         x = "")  # scale_color_d3(palette= "category20") #色卡 ，使用?scale_color_d3()查看  p    ggsave(paste0(pro,"_CCA_snn_res.0.8-boxplot.tiff"),width = 24,height = 10)  ggsave(paste0(pro,"_CCA_snn_res.0.8-boxplot.pdf"),width = 24,height = 10)})print(getwd())
## 4.7 每个样本中各个细胞类型所占比例 ----tb <- as.data.frame(table(phe$celltype,                          phe$orig.ident,phe$group));tb
pro = 'group'dat<-tb %>%  group_by(Var2) %>%  mutate(sum(Freq)) %>%  mutate(Percentage = Freq / `sum(Freq)`);dathead(dat)dat <- dat[dat$Percentage != 0,]
tb=dat[,c(1,3,6)]head(tb) 
tb$Var1=as.factor(tb$Var1)tb$Var3=as.factor(tb$Var3)colnames(tb)tbwrite.csv(tb,paste0(pro,"_celltype_proportion.csv"),quote = F)
p <- ggplot(tb,aes(x=Var3,y=Percentage,fill=Var1,color=Var1))+  geom_boxplot(outlier.alpha=0 )+facet_grid(~Var1)+  geom_jitter(aes(x = Var3, y = Percentage ))+  stat_compare_means(aes(group = Var3),                      # label = "p.signif"                     label = "p.format" )+   scale_fill_igv(alpha = 0.6)+scale_color_igv(alpha = 0.8)+  theme_bw()+  theme(axis.text.x = element_text(face = "bold",angle = 45, hjust=1, vjust=0.5,size = 14),         legend.position= "none",        strip.background = element_rect(color="black",fill = "steelblue3", linetype = "solid"),        strip.text = element_text(face = "bold", color = "black",hjust = 0.5, size = 12,),        plot.title=element_text(face="bold.italic",size="20", color="brown",hjust = 0.5),        axis.text.y = element_text(face = "bold",size = 14) ,        axis.title = element_text(size = 14),        panel.grid.major = element_blank(),        panel.grid.minor = element_blank())+  labs(title = paste0("The proportion of  cell types in ",pro),       x = "")# scale_color_d3(palette= "category20") #色卡 ，使用?scale_color_d3()查看p  ggsave("celltype_orig_group.pdf",width = 24,height = 6,plot = p)
# 5.返回工作路径 ----setwd("../../")

注：这里的代码适合刚接触项目的熟手运用，会让你事半功倍。但是对于纯小白而言，更重要的是理解代码的每一步骤，不要盲目去运行这些代码。不然报错你都不知道咋回事。另外，目前此套代码的后半部分只适用于人，如果你的数据是其他物种需要自己调整一下富集分析所需要的数据库。

如有不妥当之处，欢迎批评指正。

——生信小博士

ixxmu / mp_duty