单个样品测序了近2万个单细胞怎么办

单个样品测序了近2万个单细胞怎么办 by 单细胞天地

众所周知，10x技术推荐单个样品产出5-8K的细胞，在10x的官网也有如下所示的表格解释：

Multiplet rate (%)	# of Cell Loaded	# of Cell Recovered
0.40%	800	500
0.80%	1,600	1,000
1.60%	3,200	2,000
2.30%	4,800	3,000
3.10%	6,400	4,000
3.90%	8,000	5,000
4.60%	9,600	6,000
5.40%	11,200	7,000
6.10%	12,800	8,000
6.90%	14,400	9,000
7.60%	16,000	10,000

理论上细胞数量太多，造成的麻烦就是双细胞比例提高，但是真实情况下往往是一切其它指标都很差，比如:

Estimated Number of Cells，估计检测到的高质量细胞数
Fraction Reads in Cells，在高质量细胞的序列数百分比
Mean Reads per Cell，每个高质量细胞的平均序列数
Median Genes per Cell，每个高质量细胞的基因数中值
Total Genes Detected，所有细胞检测到的基因总数
Median UMI Counts per Cell，每个高质量细胞的平均 UMI 数

但是目前大家很喜欢要求公司多测一点，所以一万个细胞左右也能勉强接受。但是怕就怕实验环节出问题了，测序2万个单细胞甚至更多，就麻烦了。

我看到文章《Single-Cell RNA Sequencing of Peripheral Blood Mononuclear Cells From Pediatric Coeliac Disease Patients Suggests Potential Pre-Seroconversion Markers》, 也是单个样品测序了近2万个单细胞：In total, 19,663 single cells were profiled.

所以，严苛的质量控制步骤就很关键了，如下所示：

After quality control by filtering based on possible doublets, the number of genes expressed (included cells with >200 and <3,000 genes) and low quality cells (included cells with <15% mitochondrial transcript reads)(Supplementary Figure 1), we retained 9,559 cells for subsequent analyses.

近2万个单细胞，过滤后是不到1万，挺好的。其中 (Supplementary Figure 1), 如下所示：

可以看到，作者的过滤参数并不严苛，都是很常规的，而最主要的过滤效果来源于每个细胞需要有大于200个基因被检测到，这个再平凡不过了。但凡是大家读取10x的单细胞转录组数据，都是默认设置了这个过滤参数（min.features = 200 ），代码如下所示：

library(Seurat)
# https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
## Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

## Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", 
                           min.cells = 3, min.features = 200)

所以，现在你还在担心你的单细胞数据质量吗？

另外，我处理了这个文章《Single-Cell RNA Sequencing of Peripheral Blood Mononuclear Cells From Pediatric Coeliac Disease Patients Suggests Potential Pre-Seroconversion Markers》,的附件给出来了的表达量矩阵，确实是质量会有一点点小问题，但是降维聚类分群和生物学命名问题不大：

降维聚类分群和生物学命名

很容易看出来不同免疫细胞的分群：

#定义细胞亚群  
celltype[celltype$ClusterID %in% c(7,8,12,15),2]='Myeloids' 
celltype[celltype$ClusterID %in% c(0,1,2,9,10,11),2]='CD4' 
celltype[celltype$ClusterID %in% c(4,5),2]='CD8'  
celltype[celltype$ClusterID %in% c(3,6,14),2]='Bcells' 
celltype[celltype$ClusterID %in% c(13),2]='plasma'