多分组单细胞测序数据第一层次未整合和整合分析对B细胞细分的分群有何影响？ by 单细胞天地

今年暑假一起学单细胞吧（附上游数据下载tips）

这个新专辑有以下几点希冀：
带着像我一样的单细胞小白，一步步利用我们生信技能树、生信菜鸟团、单细胞天地的资源，掌握基本的scRNAseq流程
在学习的过程中，探索出合适的学习路径，帮助大家更好地利用已有资源
对过往推文中出现的错误、更新的软件进行审查，推陈出新
在过去的基本内容上深入挖掘影响小白学习的障碍，提炼总结，拓宽深度宽度
和大家讨论我在从零开始学习过程中遇到的问题，老师们在评论区指出我的不足提出建议
而我在将自己的学习笔记排版成推文时也会遵循以下行文特点：
务必详实逐步复现，如展示原推文中没展示的过程结果，添加参考资料帮助理解
重点推陈出新，如果原推文足够详细且我没遇到其他问题，可能会直接带过这篇学习推文，只在推文中展示结果，但是仍会告诉大家我看了啥，以便梳理小白学习路径

这期学习这篇推文：多分组单细胞转录组测序样本第一层次未整合和整合数据的B细胞细分对比

一开始这篇推文的学习我是很有想法的，因为曾老师给我分享了一篇投稿，投稿中使用根据病人进行批次拆分单独处理后通过anchor进行integrate（CCA, 区别于直接merge）达到去除批次效应的目的，而不是像我们之前那样直接harmony

我打算拿这篇推文数据来进行研究：拆分批次单独处理后通过anchor进行integrate(CCA)和harmony的效果有什么区别

但随着研究的进行，我发现其实这个数据集其实并不需要去除批次效应，所以我们还是像原推文那样研究“多分组单细胞转录组测序样本第一层次未整合和整合数据的B细胞细分对比”，学习一下这个拆分、merge的操作

某个数据集需不需要去批次，什么时候去批次，去批次的影响，可以参考上一期推文：harmony、不harmony，这是个问题

不同sampletype看似存在差异，有免疫细胞、非免疫细胞、外周血白细胞，但实验设计批次上还是根据病人来的，几乎每个病人都有这三种sampletype，病人没批次效应，sampletype之间的差异应该就是生物学差异

所以这里我们并不根据病人批次走harmony，以免抹除真正的差异

原推文使用的是整理好的数据，我们这里从头开始

intro

多分组单细胞转录组测序样本第一层次未整合和整合数据的B细胞细分对比

GSE164690 GEO Accession viewer

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164690

头颈部鳞状细胞癌（HNSCC）

免疫细胞（CD45+ ）；非免疫细胞（CD45-）；外周血白细胞（PBL）

51个样本，18例 treatment-naive patients （6 HPV+ and 12HPV– HNSCC 癌症病人），其中15例 CD45+, CD45-, PBL 为配对数据。

头颈部鳞状细胞癌（HNSCC）的特征是肿瘤微环境（TME）中基质细胞、上皮细胞和免疫细胞之间的复杂关系。为了开发更有效的治疗方法，我们旨在使用单细胞RNA测序（scRNAseq）研究6例人乳头瘤病毒（HPV）+和12例HPV-HNSCC患者肿瘤和匹配的外周血样本中抑制性非免疫和免疫细胞群的异质性、独特细胞群的特征以及细胞间相互作用。使用134606个细胞的数据集，我们显示了与炎症和HPV状态相关的细胞类型特异性特征，描述了在HPV+TME中具有弹性分化的成纤维细胞的负面预后价值，预测了治疗靶向的检查点受体-配体相互作用，并显示肿瘤相关巨噬细胞是TME中PD-L1和其他免疫检查点配体的主要贡献者。我们对形成HNSCC微环境的细胞内在机制和细胞间通讯提出了全面的单细胞观点。

对GSE164690数据集分别进行未整合和整合数据分析。
多分组未整合数据：CD45+ ，CD45-，PBL三组数据未整合分别进行降维分群，等进行B细胞细分的时候再merge到一块（第一层次分析数据由曾老师提供，在此感谢）。
多分组整合数据：CD45+ ，CD45-，PBL三组数据从一开始整合进行第一层次降维分群，再进行B细胞细分（该数据由齐兵提供，在此感谢）。
对曾老师的数据进行处理：首先进行了第一次B细胞细分，去除干扰亚群，而后又进行第二次B细胞细分（分辨率选用的0.8）。
齐兵的数据选用的分辨率也是0.8，其去除干扰亚群后没有再进行细分。
之后就直接用两组已对细胞亚群进行生物学命名的数据来进行对比。

数据下载

拿到ftp下载地址

ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE164nnn/GSE164690/suppl/GSE164690_RAW.tar

整理数据

参考：

初探单细胞下游
可以记住这个10X技术输出文件目录和格式，以后使用 Read10X函数读取Seurat对象时可以检查一下

使用Read10X函数读取，整理文件路径:

#!/bin/bash

# 遍历当前目录下的所有.gz文件
for file in *.gz
do
    # 提取文件名中的编号部分
    filename=$(basename "$file" .gz)
    gsm_number=$(echo "$filename" | awk -F "_" '{print $1}')
  
    # 创建对应的文件夹（如果不存在）
    mkdir -p "$gsm_number"
  
    # 移动文件到对应的文件夹中
    mv "$file" "$gsm_number"
done

将所有文件夹下这三个文件前缀全部去除:

#!/bin/bash

# 遍历文件夹
for folder in GSM*/
do
    # 进入文件夹
    cd "$folder"
  
    # 对三个文件进行重命名
    for file in *_barcodes.tsv.gz
    do
        mv "$file" "barcodes.tsv.gz"
    done

    for file in *_features.tsv.gz
    do
        mv "$file" "features.tsv.gz"
    done

    for file in *_matrix.mtx.gz
    do
        mv "$file" "matrix.mtx.gz"
    done
  
    # 返回上一级目录
    cd ..
done

获取sampletype：

$less GSE164690_series_matrix.txt.gz |grep "Sample_title"| awk -F "\"" '{ for (i=2; i<=NF; i+=2) printf("%s,", $i) }' | sed 's/,$/\n/' > sample_types.txt

library(Seurat)
library(tidyverse)

dir="./GSE164690_RAW/"
samples=list.files(dir)
samples %>% head()

sceList <- lapply(samples, function(pro){
  print(pro)
  sce <- CreateSeuratObject(counts = Read10X(file.path(dir,pro)),
                            project = gsub("^GSM[0-9]*_","",pro),
                            min.cells = 5,
                            min.features = 500)
  return(sce)
})
names(sceList)

发现GEO提供的一个样本的barcodes文件受损

去除该受损样本文件，继续：

rm(list = ls())
library(Seurat)
library(tidyverse)

dir="./GSE164690_RAW/"
samples=list.files(dir)
samples %>% head()

GSM5017045样本barcode文件受损，剔除HN10_PBL cells

# > which(samples=="GSM5017045")
# [1] 25
samples <- samples[-which(samples=="GSM5017045")]

18例 treatment-naive patients （6 HPV+ and 12HPV– HNSCC 癌症病人）免疫细胞（CD45+ ）；非免疫细胞（CD45-）；外周血白细胞（PBL）

sceList <- lapply(samples, function(pro){
  print(pro)
  sce <- CreateSeuratObject(counts = Read10X(file.path(dir,pro)),
                            project = gsub("^GSM[0-9]*_","",pro),
                            min.cells = 5,
                            min.features = 500)
  return(sce)
})
names(sceList)

一开始就merge的

sce.all=merge(x=sceList[[1]],
              y=sceList[ -1 ],
              add.cell.ids =  gsub('^GSM[0-9]*_','',samples))
sce.all.bp <- sce.all

本文根据CD45+ ，CD45-，PBL三组sample_type数据拆分而不是参考文（拆分批次单独处理后通过anchor进行integrate）中的每个病人批次

按照sampletype拆分：

sample_types <- read.table("sample_types.txt",header = FALSE,sep = ",",quote = "TRUE")
sampletype <- t(sample_types[1,][-25])
sampletype <- ifelse(grepl("CD45\\+",sampletype),"CD45+",
                     ifelse(grepl("CD45\\-",sampletype),"CD45-","PBL"))
sampletype
table(sampletype)
# CD45- CD45+   PBL 
#    15    18    17 
sampletype_df <- data.frame(sample_GSM = levels(sce.all@active.ident),
                         sampletype = sampletype)

# sce.all <- sce.all.bp
sce.all@meta.data$sampletype <- "NA"
for (i in 1:nrow(sampletype_df)){
  sce.all@meta.data[which(sce.all@meta.data$orig.ident==sampletype_df$sample_GSM[i]), "sampletype"] <- sampletype_df$sampletype[i]
}
table(sce.all$sampletype)

sce.list <- SplitObject(sce.all, split.by = "sampletype")

若根据病人拆分：

patient_df <- data.frame(sample_GSM = levels(sce.all@active.ident),
                         patient = c(rep("01",3),rep(c("02","03","04"),each=2),
                                     rep(c("05","06","07","08","09","10","11","12","13","14","15","16","17","18"),each=3))[-25]
                         )

sce.all@meta.data$patientid <- "NA"
for (i in 1:nrow(patient_df)){
  sce.all@meta.data[which(sce.all@meta.data$orig.ident==patient_df$sample_GSM[i]), "patientid"] <- patient_df$patient[i]
}
table(sce.all$patientid)

sce.patient_list <- SplitObject(sce.all, split.by = "patientid")

整合pipeline

数据质控

sce.all@meta.data$project <- "HNSCC"
sce.all <- SetIdent(sce.all, value = "project")
## 计算细胞中线粒体基因比例
sce.all[["percent.mt"]] <- PercentageFeatureSet(sce.all, pattern = "^MT-")
## 使用小提琴图可视化QC指标
VlnPlot(sce.all, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = 0)
# 点太多挡住小提琴图 pt.size 0

## FeatureScatter于可视化特征-特征关系
plot1 <- FeatureScatter(sce.all, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(sce.all, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

## 过滤
## 官方推荐过滤掉独特特征计数超过2500或小于200的细胞，或者过滤掉超过5%线粒体基因比例的细胞
sce.all.fit <- subset(sce.all, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
sce.all.fit
sce.all

数据标准化

sce.all.fit <- NormalizeData(sce.all.fit, normalization.method = "LogNormalize", scale.factor = 1e4)

识别高变基因

sce.all.fit <- FindVariableFeatures(sce.all.fit, selection.method = "vst", nfeatures = 2000) #返回两千个高变基因

## 提取前10的高变基因
top10 <- head(VariableFeatures(sce.all.fit), 10)
top10

## 展示高变基因
plot1 <- VariableFeaturePlot(sce.all.fit)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2

数据归一化

#######数据归一化##########
sce.all.fit <- ScaleData(sce.all.fit, vars.to.regress = "percent.mt")

这一步比较消耗计算资源，几次在共享服务器上做喜提warning邮件

降维

sce.all.fit <- RunPCA(sce.all.fit, features = VariableFeatures(object = sce.all.fit)) ##默认会输出5个主成分
sce.all.fit <- FindNeighbors(sce.all.fit, dims = 1:10)
sce.all.fit <- FindClusters(sce.all.fit, resolution = 0.8)  # 和原推文一致
## 查看前5分析细胞聚类数ID
head(Idents(sce.all.fit), 5)


## 查看每个类别多少个细胞
table(sce.all.fit@meta.data$seurat_clusters)

UMAP/tSNE可视化

sce.all.fit <- RunUMAP(sce.all.fit, dims = 1:10)  # Which dimensions to use as input features, used only if features is NULL
p1 <- DimPlot(sce.all.fit, reduction = "umap", label = T, label.size = 5)
sce.all.fit <- RunTSNE(sce.all.fit, dims = 1:10)
p2 <- DimPlot(sce.all.fit, reduction = "tsne", label = T, label.size = 5)
p1+p2

初步marker鉴定细胞

查看marker表达情况

参考：

多分组单细胞转录组测序样本第一层次未整合和整合数据的B细胞细分对比

B细胞细分亚群

仅仅是区分了 B细胞和 Plasma细胞。
现在我们就从文献《Single-cell landscape and clinical outcomes of infiltrating B cells in colorectal cancer》来说一下B细胞的细分亚群。它是一个单细胞数据挖掘文章，主要是关心结直肠癌的肿瘤免疫微环境里面的B细胞，两个单细胞数据集：
scRNA profiles of 63,689 cells from 25 CRC tumor and 10 peritumor tissue samples
..........
首先 B cells (form marker gene: CD79A) 可以细分成为：
CD20+ B cells
CD138+ plasma cells
然后第一个CD20+ B cells 又是可以细分成为：
naïve B cells (CD20+, CD27−, and CD38−), 主要的基因是 IGHD, FCER2, TCL1A, and IL4R,
memory B cells (CD20+, CD27+, and CD38–), 主要的基因是 CD27, AIM2, TNFRSF13B
germinal center (GC) B cells (CD20+, CD27+, CD38+, and CD138−),主要的基因是S1PI2, LRMP, SUGCT, MME, MKI67, and AICDA
而第二个 CD138+ plasma cells 可以细分成为：
IgA plasma
IgG plasma cells

DotPlot(sce.all.fit, features = c("CD3D","CD3E","CD8A",  # Tcells
  "CD19","CD79A","MS4A1",  # Bcells
  "IGHG1","MZB1","SDC1",  # Plasma cells
  "CD68","CD163","CD14",  # Monocytes and macrophages
  "FGFBP2","FCG3RA","CX3CR1",  # NK cells
  "EPCAM"  # epi or tumor
  ))+RotatedAxis()

T 0,1,2,3,5,7,8,10,19

B 6, 22

Plasma 12, 15, 20,21,23

Mono 4,11,17,18

NK 9

Unknown 13, 14, 16, 24

celltype=data.frame(ClusterID=0:24 ,
                      celltype= 0:24) 
#定义细胞亚群  
celltype[celltype$ClusterID %in% c( 0,1,2,3,5,7,8,10,19 ),2]='Tcells'   
celltype[celltype$ClusterID %in% c( 6, 22 ),2]='Bcells' 
celltype[celltype$ClusterID %in% c( 12, 15, 20,21,23 ),2]='plasma'   
celltype[celltype$ClusterID %in% c( 4,11,17,18 ),2]='Mono'   
celltype[celltype$ClusterID %in% c( 9 ),2]='NK'  
celltype[celltype$ClusterID %in% c( 13, 14, 16, 24 ),2]='Unknown' 
 
  
head(celltype)
celltype
table(celltype$celltype)

sce.all.fit@meta.data$celltype = "NA"
  
for(i in 1:nrow(celltype)){
  sce.all.fit@meta.data[which(sce.all.fit@meta.data$RNA_snn_res.0.8 == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}
Idents(sce.all.fit)=sce.all.fit$celltype
  
sel.clust = "celltype"
sce.all.fit <- SetIdent(sce.all.fit, value = sel.clust)
table(sce.all.fit@active.ident)

gplots::balloonplot(table(sce.all.fit$RNA_snn_res.0.8,sce.all.fit$celltype))

DimPlot(sce.all.fit, reduction = "umap", label = TRUE, pt.size = 0.5)+NoLegend()

重新降维和细分亚群Bcells

# 提取指定Bcells
sce.all.fit
sce.Bcells=sce.all.fit[,sce.all.fit$celltype=='Bcells']

# 重新降维分组
sce.Bcells=CreateSeuratObject(counts = sce.Bcells@assays$RNA@counts,
                       meta.data = sce.Bcells@meta.data[,1:7])
sce.Bcells <- NormalizeData(sce.Bcells, normalization.method =  "LogNormalize",  
                     scale.factor = 1e4)
GetAssay(sce.Bcells,assay = "RNA")
sce.Bcells <- FindVariableFeatures(sce.Bcells, 
                            selection.method = "vst", nfeatures = 2000)  
sce.Bcells <- ScaleData(sce.Bcells) 
sce.Bcells <- RunPCA(object = sce.Bcells, pc.genes = VariableFeatures(sce.Bcells)) 
DimHeatmap(sce.Bcells, dims = 1:12, cells = 100, balanced = TRUE)
ElbowPlot(sce.Bcells)

sce.Bcells <- FindNeighbors(sce.Bcells, dims = 1:15)  # 输入维度
sce.Bcells <- FindClusters(sce.Bcells, resolution = 0.8)
table(sce.Bcells@meta.data$RNA_snn_res.0.8)  

set.seed(123)
sce.Bcells <- RunUMAP(object = sce.Bcells, dims = 1:15, do.fast = TRUE)
DimPlot(sce.Bcells,reduction = "umap",label=T)

sce.Bcells <- SetIdent(sce.Bcells, value = "seurat_clusters")
DotPlot(sce.Bcells, features = c("IGHD","FCER2","TCL1A","IL4R",  # naive
                                 "CD27",
                                 "AIM2","TNFRSF13B",  # memory
                                 "S1PI2","LRMP","SUGCT","MME","MKI67","AICDA",  # GC
                                 "IGHG1",  # IgG plasma cells
                                 "IGHA1"  # IgA plasma
                                 ))

不光看marker表达情况，还看降维可视化（plasma和memory分不开）：

naive 1,2,9

memory 10,11,14,16

GC 7,13,15,17

IgG plasma、IgA plasma 感觉分不开，先看plasma，再往下看

plasma CD27除memory 0,3,4,5,6,8,12

IgG plasma、IgA plasma联系降维可视化也分不开算了

sce.Bcells.bp <- sce.Bcells
celltype=data.frame(ClusterID=0:17 ,
                      celltype= 0:17) 
#定义细胞亚群  
celltype[celltype$ClusterID %in% c( 1,2,9 ),2]='naive'   
celltype[celltype$ClusterID %in% c( 10,11,14,16 ),2]='memory' 
celltype[celltype$ClusterID %in% c( 0,3,4,5,6,8,12 ),2]='plasma'   
celltype[celltype$ClusterID %in% c( 7,13,15,17 ),2]='GC'   
 
  
head(celltype)
celltype
table(celltype$celltype)

sce.Bcells@meta.data$celltype = "NA"
  
for(i in 1:nrow(celltype)){
  sce.Bcells@meta.data[which(sce.Bcells@meta.data$RNA_snn_res.0.8 == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}
Idents(sce.Bcells)=sce.Bcells$celltype
  
sel.clust = "celltype"
sce.Bcells <- SetIdent(sce.Bcells, value = sel.clust)
table(sce.Bcells@active.ident)

gplots::balloonplot(table(sce.Bcells$RNA_snn_res.0.8,sce.Bcells$celltype))

DimPlot(sce.Bcells, reduction = "umap", label = TRUE, pt.size = 0.5)+NoLegend()

先拆分后在细分B细胞时merge合并pipeline

前面的流程整合到函数里：

pipeline <- function(sce.obj){
  sce.obj
  sce.obj@meta.data$project <- "HNSCC"
  sce.obj <- SetIdent(sce.obj, value = "project")
  ## 计算细胞中线粒体基因比例
  sce.obj[["percent.mt"]] <- PercentageFeatureSet(sce.obj, pattern = "^MT-")
  ## 过滤
  ## 官方推荐过滤掉独特特征计数超过2500或小于200的细胞，或者过滤掉超过5%线粒体基因比例的细胞
  sce.obj.fit <- subset(sce.obj, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
  ## 数据标准化
  sce.obj.fit <- NormalizeData(sce.obj.fit, normalization.method = "LogNormalize", scale.factor = 1e4)
  ## 识别高变基因
  sce.obj.fit <- FindVariableFeatures(sce.obj.fit, selection.method = "vst", nfeatures = 2000) #返回两千个高变基因
  #######数据归一化##########
  sce.obj.fit <- ScaleData(sce.obj.fit, vars.to.regress = "percent.mt")
  ## 降维
  sce.obj.fit <- RunPCA(sce.obj.fit, features = VariableFeatures(object = sce.obj.fit)) ##默认会输出5个主成分
  sce.obj.fit <- FindNeighbors(sce.obj.fit, dims = 1:10)
  sce.obj.fit <- FindClusters(sce.obj.fit, resolution = 0.8)  # 和原推文一致
  ## UMAP/tSNE可视化
  sce.obj.fit <- RunUMAP(sce.obj.fit, dims = 1:10)  # Which dimensions to use as input features, used only if features is NULL
  p1 <- DimPlot(sce.obj.fit, reduction = "umap", label = T, label.size = 5)
  sce.obj.fit <- RunTSNE(sce.obj.fit, dims = 1:10)
  p2 <- DimPlot(sce.obj.fit, reduction = "tsne", label = T, label.size = 5)
  p1+p2
  return(sce.obj.fit)
}

clustered.sce.list <- lapply(sce.list, pipeline)

初步分型

定义函数：

dimplot <- function(sce.obj.fit){
  sce.obj.fit
  ## UMAP/tSNE可视化
  # sce.obj.fit <- RunUMAP(sce.obj.fit, dims = 1:10)  # Which dimensions to use as input features, used only if features is NULL
  p1 <- DimPlot(sce.obj.fit, reduction = "umap", label = T, label.size = 5)
  # sce.obj.fit <- RunTSNE(sce.obj.fit, dims = 1:10)
  p2 <- DimPlot(sce.obj.fit, reduction = "tsne", label = T, label.size = 5)
  return(p1+p2)
}

lapply(clustered.sce.list, dimplot)

dotplot <- function(sce.obj){
  dp <- DotPlot(sce.obj, features = c("CD3D","CD3E","CD8A",  # Tcells
  "CD19","CD79A","MS4A1",  # Bcells
  "IGHG1","MZB1","SDC1",  # Plasma cells
  "CD68","CD163","CD14",  # Monocytes and macrophages
  "FGFBP2","FCG3RA","CX3CR1",  # NK cells
  "EPCAM"  # epi or tumor
  ))+RotatedAxis()
  return(dp)
}
lapply(clustered.sce.list, dotplot)

细胞亚型注释函数

celltype_anno <- function(celltype, sce.obj){
  sce.obj@meta.data$celltype = "NA"
    
  for(i in 1:nrow(celltype)){
    sce.obj@meta.data[which(sce.obj@meta.data$RNA_snn_res.0.8 == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}
  Idents(sce.obj)=sce.obj$celltype
    
  sel.clust = "celltype"
  sce.obj <- SetIdent(sce.obj, value = sel.clust)
  table(sce.obj@active.ident) 
  
  balloonPlot <- gplots::balloonplot(table(sce.obj$RNA_snn_res.0.8,sce.obj$celltype))
  
  dimplot <- DimPlot(sce.obj, reduction = "umap", label = TRUE, pt.size = 0.5)+NoLegend()
  
  balloonPlot
  dimplot
}

开始注释：

lapply(clustered.sce.list, dimplot)
lapply(clustered.sce.list, dotplot)

PBL

# sce.PBL <- clustered.sce.list$PBL
celltype=data.frame(ClusterID=0:16 ,
                      celltype= 0:16) 
#定义细胞亚群  
celltype[celltype$ClusterID %in% c( 1,2,4,5,7,10,12 ),2]='Tcells'   
celltype[celltype$ClusterID %in% c( 9 ),2]='Bcells' 
celltype[celltype$ClusterID %in% c( 15,16 ),2]='plasma'   
celltype[celltype$ClusterID %in% c( 0,8,11,13,14 ),2]='Mono'   
celltype[celltype$ClusterID %in% c( 3,6 ),2]='NK'  
  
head(celltype)
celltype
table(celltype$celltype)

celltype_anno(celltype,clustered.sce.list$PBL)

T 1,2,4,5,7,10,12

B 9

Plasma 15,16

Mono 0,8,11,13,14

NK 3,6

Unknown

CD45+

celltype=data.frame(ClusterID=0:19 ,
                      celltype= 0:19) 
#定义细胞亚群  
celltype[celltype$ClusterID %in% c( 0,1,2,3,5,6,8,9,10,11 ),2]='Tcells'   
celltype[celltype$ClusterID %in% c( 4,18),2]='Bcells' 
celltype[celltype$ClusterID %in% c( 14,15 ),2]='plasma'   
celltype[celltype$ClusterID %in% c( 7,12,16,17,19 ),2]='Mono'   
celltype[celltype$ClusterID %in% c( 13 ),2]='NK'  
  
head(celltype)
celltype
table(celltype$celltype)

celltype_anno(celltype,clustered.sce.list$`CD45+`)

T 0,1,2,3,5,6,8,9,10,11

B 4,18

Plasma 14,15

Mono 7,12,16,17,19

NK 13

Unknown

CD45-

celltype=data.frame(ClusterID=0:19 ,
                      celltype= 0:19) 
#定义细胞亚群  
celltype[celltype$ClusterID %in% c(4,5,6,7 ),2]='Tcells'   
celltype[celltype$ClusterID %in% c( 8),2]='Bcells' 
celltype[celltype$ClusterID %in% c( 1,2,10,11,12,14,16 ),2]='plasma'   
celltype[celltype$ClusterID %in% c(9,17,18),2]='Mono'   
celltype[celltype$ClusterID %in% c( 3,15,19 ),2]='Unknown'  
  
head(celltype)
celltype
table(celltype$celltype)

celltype_anno(celltype,clustered.sce.list$`CD45-`)

T 4,5,6,7

B 8

Plasma 1,2,10,11,12,14,16

Mono 9,17,18

Unknown 3,15,19

个人感觉：拆分后初步判断亚型比整合一起好区分，Unknown少

将3sampletype分组各B细胞合并

sce.Bcells=sce.all.fit[,sce.all.fit$celltype=='Bcells']

sce.Bcells1 <- clustered.sce.list$PBL[,clustered.sce.list$PBL$seurat_clusters == 9]
sce.Bcells2 <- clustered.sce.list$`CD45+`[,clustered.sce.list$`CD45+`$seurat_clusters %in% c(4,18)]
sce.Bcells3 <- clustered.sce.list$`CD45-`[,clustered.sce.list$`CD45-`$seurat_clusters == 8]

sce.Bcells.all <- merge(x=sce.Bcells1,
              y=list(sce.Bcells2,sce.Bcells3),
              # add.cell.ids =  gsub('^GSM[0-9]*_','',samples),
              add.cell.ids =  c("sce.Bcells1", "sce.Bcells2", "sce.Bcells3"))

sce.Bcells.all=CreateSeuratObject(counts = sce.Bcells.all@assays$RNA@counts,
                       meta.data = sce.Bcells@meta.data[,1:7])

这一步merge时如果不add.cell.ids会报错：Error in merge.Seurat(x = sce.Bcells1, y = list(sce.Bcells2, sce.Bcells3), :Please provide a cell identifier for each object provided to merge

此外，还需要注意：

参考

Seurat对象内部结构

所以需要把counts提出来再重新创建一个Seurat对象：

sce.Bcells.all=CreateSeuratObject(counts = sce.Bcells.all@assays$RNA@counts,
                       meta.data = sce.Bcells@meta.data[,1:7])

Bcells细分

sce.Bcells.all <- NormalizeData(sce.Bcells.all, normalization.method =  "LogNormalize",  
                     scale.factor = 1e4)
GetAssay(sce.Bcells.all,assay = "RNA")
sce.Bcells.all <- FindVariableFeatures(sce.Bcells.all, 
                            selection.method = "vst", nfeatures = 2000)  
sce.Bcells.all <- ScaleData(sce.Bcells.all) 
sce.Bcells.all <- RunPCA(object = sce.Bcells.all, pc.genes = VariableFeatures(sce.Bcells.all)) 
DimHeatmap(sce.Bcells.all, dims = 1:12, cells = 100, balanced = TRUE)
ElbowPlot(sce.Bcells.all)

sce.Bcells.all <- FindNeighbors(sce.Bcells.all, dims = 1:15)  # 输入维度
sce.Bcells.all <- FindClusters(sce.Bcells.all, resolution = 0.8)
table(sce.Bcells.all@meta.data$RNA_snn_res.0.8)  

set.seed(123)
sce.Bcells.all <- RunUMAP(object = sce.Bcells.all, dims = 1:15, do.fast = TRUE)
DimPlot(sce.Bcells.all,reduction = "umap",label=T)

sce.Bcells.all <- SetIdent(sce.Bcells.all, value = "seurat_clusters")
DotPlot(sce.Bcells.all, features = c("IGHD","FCER2","TCL1A","IL4R",  # naive
                                 "CD27",
                                 "AIM2","TNFRSF13B",  # memory
                                 "S1PI2","LRMP","SUGCT","MME","MKI67","AICDA",  # GC
                                 "IGHG1",  # IgG plasma cells
                                 "IGHA1"  # IgA plasma
                                 ))

不光看marker表达情况，还看降维可视化（plasma和memory分不开）：

naive 1,2,8

memory 0,3,4,5,6,10,

GC 9,12,14,17

plasma CD27除memory 7,11,13,15,16,18

(7因为15很像特别是纠结的CD27所以给plasma)

sce.Bcells.all.bp <- sce.Bcells.all
celltype=data.frame(ClusterID=0:18 ,
                      celltype= 0:18) 
#定义细胞亚群  
celltype[celltype$ClusterID %in% c( 1,2,8 ),2]='naive'   
celltype[celltype$ClusterID %in% c( 0,3,4,5,6,10 ),2]='memory' 
celltype[celltype$ClusterID %in% c( 7,11,13,15,16,18 ),2]='plasma'   
celltype[celltype$ClusterID %in% c( 9,12,14,17 ),2]='GC'   
 
  
head(celltype)
celltype
table(celltype$celltype)

sce.Bcells.all@meta.data$celltype = "NA"
  
for(i in 1:nrow(celltype)){
  sce.Bcells.all@meta.data[which(sce.Bcells.all@meta.data$RNA_snn_res.0.8 == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}
Idents(sce.Bcells.all)=sce.Bcells.all$celltype
  
sel.clust = "celltype"
sce.Bcells.all <- SetIdent(sce.Bcells.all, value = sel.clust)
table(sce.Bcells.all@active.ident)

gplots::balloonplot(table(sce.Bcells.all$RNA_snn_res.0.8,sce.Bcells.all$celltype))

DimPlot(sce.Bcells.all, reduction = "umap", label = TRUE, pt.size = 0.5)+NoLegend()

两套pipelines B细胞划分结果列联分析

sce.Bcells.all$celltype %>% head()
sce.Bcells$celltype %>% head()

修改names

ls <- sce.Bcells.all$celltype %>% names() %>% str_split_fixed("_",3)
tmp <- sce.Bcells.all$celltype
names(tmp) <- paste(ls[,2],ls[,3],sep = "_")

tmp %>% head()
sce.Bcells$celltype %>% head()

keep <- intersect(names(tmp), names(sce.Bcells$celltype))
gplots::balloonplot(table(tmp[keep], sce.Bcells$celltype[keep]))

可以发现我这里plasma和memory在拆分前后存在非常大变化，基本上就是exchange了。。。其它的还行

这跟我选择的marker和自定义分组也有关，这两个在亚型定义的时候就不是很好区分（我的技术也不好，肉眼看这个我目前感觉有点反人类，后面我了解到一些辅助确定亚群名称的方法，如AUcell、MACA、scGate【flag】）

原推文小韩师姐的结果就没这么明显的exchange：

因此，来回答开头提出的问题，从该组数据对比来看，多分组单细胞测序数据第一层次未整合和整合分析对B细胞细分的分群基本无影响。

这里因为原推文相当于就拿两种方法处理好的结果可视化看看列联表，具体参数和使用marker未知。

往期回顾

2013-2023年度技术十年 | 单细胞的群星闪耀时

Libra：单细胞差异分析算法的全家桶

harmony、不harmony，这是个问题

NK细胞的单细胞层面细分亚群

被忽视的真相 | 细胞亚群内互作

如果你对单细胞转录组研究感兴趣，但又不知道如何入门，也许你可以关注一下下面的课程

看完记得顺手点个“在看”哦！

生物 | 单细胞 | 转录组丨资料

每天都精彩

长按扫码可关注

ixxmu / mp_duty

多分组单细胞测序数据第一层次未整合和整合分析对B细胞细分的分群有何影响？ #3783