Closed ixxmu closed 3 years ago
主要是针对单细胞转录组测序数据开发的,用来找不同细胞类型或者不同细胞状态的差异表达基因。分析起始是表达矩阵,作者推荐用比较老旧的Tophat+Cufflinks流程,或者RSEM, eXpress,Sailfish,等等。需要的是基于转录本的表达矩阵,我一般用subjunc+featureCounts 来获取表达矩阵。
由Cole Trapnell 于2014年在Nature Biotechnology 杂志发表,是一个略微复杂的R包,并给出了一个测试数据 ,下载地址是:
Source code |
---|
HSMM expression data |
安装方法是:
install.packages(c("VGAM", "irlba", "matrixStats", "igraph",
"combinat", "fastICA", "grid", "ggplot2",
"reshape2", "plyr", "parallel", "methods"))
$ R CMD INSTALL HSMMSingleCell_0.99.0.tar.gz
$ R CMD INSTALL monocle_0.99.0.tar.gz
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("monocle")
library(monocle)
这一版的教程有点过时了,还用的是tophat+cufflinks组合来计算表达量, 就不过多介绍了。
在nature methods杂志发表的文章,更新为monocle2版本并且更换了主页,功能也不仅仅是差异分析那么简单。还包括pseudotime,clustering分析,而且还可以进行基于转录本的差异分析,其算法是BEAM (used in branch analysis) and Census (the core of relative2abs),也单独发表了文章。
用了4个公共的数据来测试说明其软件的用法和优点。
the HSMM data set, GSE52529 (ref. 1);
the lung data set, GSE52583 (ref. 8);
the Paul et al. data set ;
the Olsson data set9, synapse ID syn4975060.
也是有着非常详细的使用教程 , 读取表达矩阵和分组信息,需要理解其定义好的一些S4对象。
还提出了好几个算法:
dpFeature
: Selecting features from dense cell clusters
Reversed graph embedding
DRTree
: Dimensionality Reduction via Learning a Tree
DDRTree
: discriminative dimensionality reduction via learning a tree
Census
: a normalization method to convert of single-cell mRNA transcript to relative transcript counts.
BEAM
: to test for branch-dependent gene expression by formulating the problem as a contrast between two negative binomial GLMs.
Branch time point detection algorithm :
主要是基于 CellDataSet 对象来进行下游分析,继承自ExpressionSet对象,也是常见的3个组成:
exprs
, a numeric matrix of expression values, where rows are genes, and columns are cells
phenoData
, an AnnotatedDataFrame
object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)
featureData
, an AnnotatedDataFrame
object, where rows are features (e.g. genes), and columns are gene attributes, such as biotype, gc content, etc.
可以从头创建这样的对象,代码如下:
#do not run
HSMM_expr_matrix <- read.table("fpkm_matrix.txt")
HSMM_sample_sheet <- read.delim("cell_sample_sheet.txt")
HSMM_gene_annotation <- read.delim("gene_annotations.txt")
pd <- new("AnnotatedDataFrame", data = HSMM_sample_sheet)
fd <- new("AnnotatedDataFrame", data = HSMM_gene_annotation)
HSMM <- newCellDataSet(as.matrix(HSMM_expr_matrix), phenoData = pd, featureData = fd)
创建对象的时候需要指定引入的表达矩阵的方法,monocle2推荐用基于转录本的counts矩阵,同时也是默认的参数 expressionFamily=negbinomial.size()
,如果是其它RPKM/TMP等等,需要找到对应的参数。
monocle在bioconductor官网的主页给出了比较详尽的测试数据的示例代码:
R Script
基本上花上几个小时运行该例子,一步步理解输入输出,就可以学会使用。当然,要看懂算法就比较费劲了,需要仔细读paper。
值得一提的是最新版的monocle(version 2.4.0)依赖于 R version 3.4.0 ,如果R没有升级,即使强行安装了最新版monocle也是无济于事。
install.packages('https://www.bioconductor.org/packages/release/bioc/bin/macosx/el-capitan/contrib/3.4/monocle_2.4.0.tgz',
repos=NULL, type="source")
载入表达矩阵并转化为CellDataSet对象
对表达矩阵进行基于基因和样本的过滤并可视化
无监督的聚类
pseudotime分析
差异分析
下面是实战演练:
monocle在bioconductor官网的主页给出了比较详尽的测试数据的示例代码:
R Script
基本上花上几个小时运行该例子,一步步理解输入输出,就可以学会使用。当然,要看懂算法就比较费劲了,需要仔细读paper。
如果还没安装,就运行:
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("monocle")
biocLite("HSMMSingleCell")
如果已经安装,请直接加载
library(Biobase)
library(knitr)
library(reshape2)
library(ggplot2)
library(HSMMSingleCell)
library(monocle)
data(HSMM_expr_matrix) ## RPKM 矩阵,271个细胞,47192个基因
data(HSMM_gene_annotation)
data(HSMM_sample_sheet)
HSMM_expr_matrix[1:10,1:5]
## T0_CT_A01 T0_CT_A03 T0_CT_A05 T0_CT_A06 T0_CT_A07
## ENSG00000000003.10 21.984400 1.280040 43.461800 0.00000 39.807600
## ENSG00000000005.5 0.000000 0.000000 0.000000 0.00000 0.000000
## ENSG00000000419.8 40.059700 77.580800 6.496560 4.90934 1.156520
## ENSG00000000457.8 0.937081 0.729195 0.000000 0.00000 0.000000
## ENSG00000000460.12 0.740922 57.578500 3.935870 0.00000 0.000000
## ENSG00000000938.8 0.000000 0.000000 0.000000 0.00000 0.000000
## ENSG00000000971.11 3.002980 15.302400 50.804800 4.68513 0.000000
## ENSG00000001036.8 128.197000 16.086700 25.320900 10.66480 63.773500
## ENSG00000001084.6 7.619720 0.000000 0.000000 0.00000 0.000000
## ENSG00000001167.10 13.024900 24.777600 0.681409 1.36587 0.399352
head(HSMM_gene_annotation)
## gene_short_name biotype num_cells_expressed
## ENSG00000000003.10 TSPAN6 protein_coding 231
## ENSG00000000005.5 TNMD protein_coding 0
## ENSG00000000419.8 DPM1 protein_coding 275
## ENSG00000000457.8 SCYL3 protein_coding 24
## ENSG00000000460.12 C1orf112 protein_coding 78
## ENSG00000000938.8 FGR protein_coding 0
## use_for_ordering
## ENSG00000000003.10 FALSE
## ENSG00000000005.5 FALSE
## ENSG00000000419.8 FALSE
## ENSG00000000457.8 FALSE
## ENSG00000000460.12 TRUE
## ENSG00000000938.8 FALSE
head(HSMM_sample_sheet)
## Library Well Hours Media Mapped.Fragments Pseudotime State
## T0_CT_A01 SCC10013_A01 A01 0 GM 1958074 23.916673 1
## T0_CT_A03 SCC10013_A03 A03 0 GM 1930722 9.022265 1
## T0_CT_A05 SCC10013_A05 A05 0 GM 1452623 7.546608 1
## T0_CT_A06 SCC10013_A06 A06 0 GM 2566325 21.463948 1
## T0_CT_A07 SCC10013_A07 A07 0 GM 2383438 11.299806 1
## T0_CT_A08 SCC10013_A08 A08 0 GM 1472238 67.436042 2
主要是读取表达矩阵和样本描述信息,这里介绍两种方式,一种是读取基于 subjunc+featureCounts 分析后的reads counts矩阵,一种是读取 tophat+cufflinks 得到的RPKM表达矩阵。
library(monocle)
library(scater, quietly = TRUE)
library(knitr)
options(stringsAsFactors = FALSE)
# 这个文件是表达矩阵,包括线粒体基因和 ERCC spike-ins 的表达量,可以用来做质控
molecules <- read.table("tung/molecules.txt", sep = "\t")
## 这个文件是表达矩阵涉及到的所有样本的描述信息,包括样本来源于哪个细胞,以及哪个批次。
anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE)
rownames(anno)=colnames(molecules)
library(org.Hs.eg.db)
eg2symbol=toTable(org.Hs.egSYMBOL)
eg2ensembl=toTable(org.Hs.egENSEMBL)
egid=eg2ensembl[ match(rownames(molecules),eg2ensembl$ensembl_id),'gene_id']
symbol=eg2symbol[match( egid ,eg2symbol$gene_id),'symbol']
gene_annotation = data.frame(ensembl=rownames(molecules),
gene_short_name=symbol,
egid=egid)
rownames(gene_annotation)=rownames(molecules)
pd <- new("AnnotatedDataFrame", data = anno)
fd <- new("AnnotatedDataFrame", data = gene_annotation)
#tung <- newCellDataSet(as.matrix(molecules), phenoData = pd, featureData = fd)
tung <- newCellDataSet(as(as.matrix(molecules), "sparseMatrix"),
phenoData = pd,
featureData = fd,
lowerDetectionLimit=0.5,
expressionFamily=negbinomial.size())
tung
## CellDataSet (storageMode: environment)
## assayData: 19027 features, 864 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: NA19098.r1.A01 NA19098.r1.A02 ... NA19239.r3.H12
## (864 total)
## varLabels: individual replicate ... Size_Factor (6 total)
## varMetadata: labelDescription
## featureData
## featureNames: ENSG00000237683 ENSG00000187634 ... ERCC-00171
## (19027 total)
## fvarLabels: ensembl gene_short_name egid
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
可以看到 对象已经构造成功,是一个包含了 19027 features, 864 samples
的表达矩阵,需要进行一系列的过滤之后,拿到高质量的单细胞转录组数据进行下游分析。
这些样本来源于3个不同的人,每个人有3个批次的单细胞,每个批次单细胞都是96个。
pd <- new("AnnotatedDataFrame", data = HSMM_sample_sheet)
fd <- new("AnnotatedDataFrame", data = HSMM_gene_annotation)
# First create a CellDataSet from the relative expression levels
## 这里仅仅是针对rpkm表达矩阵的读取
HSMM <- newCellDataSet(as.matrix(HSMM_expr_matrix),
phenoData = pd,
featureData = fd,
lowerDetectionLimit=0.1,
expressionFamily=tobit(Lower=0.1))
# Next, use it to estimate RNA counts
rpc_matrix <- relative2abs(HSMM)
rpc_matrix[1:10,1:5]
## T0_CT_A01 T0_CT_A03 T0_CT_A05 T0_CT_A06 T0_CT_A07
## ENSG00000000003.10 1.60309506 0.09929705 2.93679928 0.00000000 2.18692386
## ENSG00000000005.5 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## ENSG00000000419.8 2.92113986 6.01820615 0.43898533 0.34343867 0.06353614
## ENSG00000000457.8 0.06833163 0.05656613 0.00000000 0.00000000 0.00000000
## ENSG00000000460.12 0.05402778 4.46655980 0.26595447 0.00000000 0.00000000
## ENSG00000000938.8 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## ENSG00000000971.11 0.21897629 1.18705914 3.43298023 0.32775379 0.00000000
## ENSG00000001036.8 9.34808217 1.24789995 1.71098300 0.74606865 3.50354678
## ENSG00000001084.6 0.55562742 0.00000000 0.00000000 0.00000000 0.00000000
## ENSG00000001167.10 0.94977133 1.92208258 0.04604415 0.09555105 0.02193934
## rpkm格式的表达值需要转换成reads counts之后才可以进行下游分析!
# Now, make a new CellDataSet using the RNA counts
HSMM <- newCellDataSet(as(as.matrix(rpc_matrix), "sparseMatrix"),
phenoData = pd,
featureData = fd,
lowerDetectionLimit=0.5,
expressionFamily=negbinomial.size())
下面的分析,都基于内置数据构建的S4对象,HSMM
这里只是把 基因挑选出来,并没有对S4对象进行过滤操作。 这个 detectGenes 函数还计算了 每个细胞里面表达的基因数量。
HSMM <- estimateSizeFactors(HSMM)
HSMM <- estimateDispersions(HSMM)
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## Removing 139 outliers
HSMM <- detectGenes(HSMM, min_expr = 0.1)
print(head(fData(HSMM)))
## gene_short_name biotype num_cells_expressed
## ENSG00000000003.10 TSPAN6 protein_coding 184
## ENSG00000000005.5 TNMD protein_coding 0
## ENSG00000000419.8 DPM1 protein_coding 211
## ENSG00000000457.8 SCYL3 protein_coding 18
## ENSG00000000460.12 C1orf112 protein_coding 47
## ENSG00000000938.8 FGR protein_coding 0
## use_for_ordering
## ENSG00000000003.10 FALSE
## ENSG00000000005.5 FALSE
## ENSG00000000419.8 FALSE
## ENSG00000000457.8 FALSE
## ENSG00000000460.12 TRUE
## ENSG00000000938.8 FALSE
## 对每个基因都检查一下在多少个细胞里面是有表达量的。
## 只留下至少在10个细胞里面有表达量的那些基因,做后续分析
expressed_genes <- row.names(subset(fData(HSMM), num_cells_expressed >= 10))
length(expressed_genes) ## 只剩下了14224个基因
## [1] 14224
print(head(pData(HSMM)))
## Library Well Hours Media Mapped.Fragments Pseudotime State
## T0_CT_A01 SCC10013_A01 A01 0 GM 1958074 23.916673 1
## T0_CT_A03 SCC10013_A03 A03 0 GM 1930722 9.022265 1
## T0_CT_A05 SCC10013_A05 A05 0 GM 1452623 7.546608 1
## T0_CT_A06 SCC10013_A06 A06 0 GM 2566325 21.463948 1
## T0_CT_A07 SCC10013_A07 A07 0 GM 2383438 11.299806 1
## T0_CT_A08 SCC10013_A08 A08 0 GM 1472238 67.436042 2
## Size_Factor num_genes_expressed
## T0_CT_A01 1.392811 6850
## T0_CT_A03 1.311607 6947
## T0_CT_A05 1.218922 7019
## T0_CT_A06 1.013981 5560
## T0_CT_A07 1.085580 5998
## T0_CT_A08 1.099878 6055
这里选择的是通过不同时间点取样的细胞来进行分组查看,把 超过2个sd 的那些样本的临界值挑选出来,下一步过滤的时候使用。
pData(HSMM)$Total_mRNAs <- Matrix::colSums(exprs(HSMM))
HSMM <- HSMM[,pData(HSMM)$Total_mRNAs < 1e6]
upper_bound <- 10^(mean(log10(pData(HSMM)$Total_mRNAs)) +
2*sd(log10(pData(HSMM)$Total_mRNAs)))
lower_bound <- 10^(mean(log10(pData(HSMM)$Total_mRNAs)) -
2*sd(log10(pData(HSMM)$Total_mRNAs)))
table(pData(HSMM)$Hours)
##
## 0 24 48 72
## 69 74 79 49
qplot(Total_mRNAs, data = pData(HSMM), color = Hours, geom = "density") +
geom_vline(xintercept = lower_bound) +
geom_vline(xintercept = upper_bound)
上面已经根据基因表达情况以及样本的总测序数据选择好了阈值,下面就可以可视化并且对比检验一下执行过滤与否的区别。
HSMM <- HSMM[,pData(HSMM)$Total_mRNAs > lower_bound &
pData(HSMM)$Total_mRNAs < upper_bound]
HSMM <- detectGenes(HSMM, min_expr = 0.1)
L <- log(exprs(HSMM[expressed_genes,]))
melted_dens_df <- melt(Matrix::t(scale(Matrix::t(L))))
qplot(value, geom="density", data=melted_dens_df) + stat_function(fun = dnorm, size=0.5, color='red') +
xlab("Standardized log(FPKM)") +
ylab("Density")
下面这个代码只适用于这个测试数据, 主要是生物学背景知识,用MYF5基因和ANPEP基因来对细胞进行分类,可以区分Myoblast和Fibroblast。如果是自己的数据,建议多读读paper看看如何选取合适的基因,或者干脆跳过这个代码。
## 根据基因名字找到其在表达矩阵的ID,这里是ENSEMBL数据库的ID
MYF5_id <- row.names(subset(fData(HSMM), gene_short_name == "MYF5"))
ANPEP_id <- row.names(subset(fData(HSMM), gene_short_name == "ANPEP"))
## 这里选取的基因取决于自己的单细胞实验设计
cth <- newCellTypeHierarchy()
cth <- addCellType(cth, "Myoblast", classify_func = function(x) { x[MYF5_id,] >= 1 })
cth <- addCellType(cth, "Fibroblast", classify_func = function(x)
{ x[MYF5_id,] < 1 & x[ANPEP_id,] > 1 })
HSMM <- classifyCells(HSMM, cth, 0.1)
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## 这个时候的HSMM已经被改变了,增加了属性。
table(pData(HSMM)$CellType)
##
## Fibroblast Myoblast Unknown
## 60 87 124
pie <- ggplot(pData(HSMM), aes(x = factor(1), fill = factor(CellType))) +
geom_bar(width = 1)
pie + coord_polar(theta = "y") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank())
可以看到还有很大一部分细胞仅仅是根据这两个基因的表达量是无法成功的归类的。这个是很正常的,因为单细胞转录组测序里面的mRNA捕获率不够好。 通过这个步骤成功的给HSMM这个S4对象增加了一个属性,就是CellType,在下面的分析中会用得着。
这里需要安装最新版R包才可以使用里面的一些函数,因为上面的步骤基于指定基因的表达量进行细胞分组会漏掉很多信息,所以需要更好的聚类方式。
disp_table <- dispersionTable(HSMM)
head(disp_table)
## gene_id mean_expression dispersion_fit dispersion_empirical
## 1 ENSG00000000003.10 1.80534418 1.249323 1.215666
## 2 ENSG00000000419.8 2.17342979 1.099130 1.008759
## 3 ENSG00000000457.8 0.02518587 63.932303 23.177101
## 4 ENSG00000000460.12 0.15331486 10.805439 17.941440
## 5 ENSG00000000971.11 2.45231977 1.015354 1.287973
## 6 ENSG00000001036.8 1.04484075 1.894827 1.540376
## 只有满足 条件的10198个基因才能进入聚类分析
unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1)
HSMM <- setOrderingFilter(HSMM, unsup_clustering_genes$gene_id)
plot_ordering_genes(HSMM)
## 这里看看基因的表达量和基因的变异度之间的关系
## 处在灰色阴影区域的基因会被抛弃掉,不进入聚类分析。
plot_pc_variance_explained(HSMM, return_all = F) # norm_method = 'log',
HSMM <- reduceDimension(HSMM, max_components=2, num_dim = 6,
reduction_method = 'tSNE', verbose = T)
HSMM <- clusterCells(HSMM, num_clusters=2)
## Distance cutoff calculated to 1.072748
## 这里先用tSNE的聚类方法处理HSMM数据集,并可视化展示
plot_cell_clusters(HSMM, 1, 2, color="CellType", markers=c("MYF5", "ANPEP"))
## 可以看到并不能把细胞类型完全区分开,这个是完全有可能的,因为虽然是同一种细胞,但是有着不同的培养条件。
head(pData(HSMM))
## Library Well Hours Media Mapped.Fragments Pseudotime State
## T0_CT_A01 SCC10013_A01 A01 0 GM 1958074 23.916673 1
## T0_CT_A03 SCC10013_A03 A03 0 GM 1930722 9.022265 1
## T0_CT_A05 SCC10013_A05 A05 0 GM 1452623 7.546608 1
## T0_CT_A06 SCC10013_A06 A06 0 GM 2566325 21.463948 1
## T0_CT_A07 SCC10013_A07 A07 0 GM 2383438 11.299806 1
## T0_CT_A08 SCC10013_A08 A08 0 GM 1472238 67.436042 2
## Size_Factor num_genes_expressed Total_mRNAs CellType Cluster
## T0_CT_A01 1.392811 6850 39080 Myoblast 2
## T0_CT_A03 1.311607 6947 36720 Myoblast 1
## T0_CT_A05 1.218922 7019 34112 Myoblast 1
## T0_CT_A06 1.013981 5560 28384 Myoblast 2
## T0_CT_A07 1.085580 5998 30360 Myoblast 1
## T0_CT_A08 1.099878 6055 30808 Unknown 2
## peaks halo delta rho
## T0_CT_A01 FALSE FALSE 1.0694920 1.146961
## T0_CT_A03 FALSE FALSE 0.5544267 2.744092
## T0_CT_A05 FALSE FALSE 0.3270436 4.479191
## T0_CT_A06 FALSE FALSE 0.4767768 2.416054
## T0_CT_A07 FALSE FALSE 0.6011590 2.593689
## T0_CT_A08 FALSE FALSE 1.2702897 2.395104
head(fData(HSMM))
## gene_short_name biotype num_cells_expressed
## ENSG00000000003.10 TSPAN6 protein_coding 184
## ENSG00000000005.5 TNMD protein_coding 0
## ENSG00000000419.8 DPM1 protein_coding 211
## ENSG00000000457.8 SCYL3 protein_coding 18
## ENSG00000000460.12 C1orf112 protein_coding 47
## ENSG00000000938.8 FGR protein_coding 0
## use_for_ordering
## ENSG00000000003.10 TRUE
## ENSG00000000005.5 FALSE
## ENSG00000000419.8 TRUE
## ENSG00000000457.8 FALSE
## ENSG00000000460.12 TRUE
## ENSG00000000938.8 FALSE
## 所以这里也区分一下 培养基, a high-mitogen growth medium (GM) to a low-mitogen differentiation medium (DM).
plot_cell_clusters(HSMM, 1, 2, color="Media")
## 因为我们假设就2种细胞类型,所以在做聚类的时候可以把这个参数添加进去,这样可以去除无关变量的干扰。
HSMM <- reduceDimension(HSMM, max_components=2, num_dim = 2, reduction_method = 'tSNE',
residualModelFormulaStr="~Media + num_genes_expressed", verbose = T) #
HSMM <- clusterCells(HSMM, num_clusters=2)
## Distance cutoff calculated to 1.284778
plot_cell_clusters(HSMM, 1, 2, color="CellType")
plot_cell_clusters(HSMM, 1, 2, color="Cluster") + facet_wrap(~CellType)
## 这里的差异分析非常耗时
marker_diff <- markerDiffTable(HSMM[expressed_genes,],
cth,
residualModelFormulaStr="~Media + num_genes_expressed",
cores=1)
head(marker_diff)
## status family pval qval
## ENSG00000000003.10 OK negbinomial.size 0.8548230 1.0000000
## ENSG00000000419.8 OK negbinomial.size 0.9329316 1.0000000
## ENSG00000000457.8 OK negbinomial.size 0.7176166 0.9954975
## ENSG00000000460.12 OK negbinomial.size 0.2700496 0.8250088
## ENSG00000000971.11 OK negbinomial.size 0.4489895 0.9171190
## ENSG00000001036.8 OK negbinomial.size 0.5731998 0.9524046
## gene_short_name biotype num_cells_expressed
## ENSG00000000003.10 TSPAN6 protein_coding 184
## ENSG00000000419.8 DPM1 protein_coding 211
## ENSG00000000457.8 SCYL3 protein_coding 18
## ENSG00000000460.12 C1orf112 protein_coding 47
## ENSG00000000971.11 CFH protein_coding 198
## ENSG00000001036.8 FUCA2 protein_coding 171
## use_for_ordering
## ENSG00000000003.10 TRUE
## ENSG00000000419.8 TRUE
## ENSG00000000457.8 FALSE
## ENSG00000000460.12 TRUE
## ENSG00000000971.11 TRUE
## ENSG00000001036.8 TRUE
## 就是对每个基因增加了pval和qval两列信息,挑选出那些在不同media培养条件下显著差异表达的基因,310个,
candidate_clustering_genes <- row.names(subset(marker_diff, qval < 0.01))
## 计算这310个基因在不同的celltype的specificity值
marker_spec <- calculateMarkerSpecificity(HSMM[candidate_clustering_genes,], cth)
head(selectTopMarkers(marker_spec, 3))
## gene_id CellType specificity
## 1 ENSG00000019991.11 Fibroblast 0.9892130
## 2 ENSG00000128340.10 Fibroblast 0.9999602
## 3 ENSG00000163710.3 Fibroblast 0.9729971
## 4 ENSG00000111049.3 Myoblast 0.9743099
## 5 ENSG00000239922.1 Myoblast 0.9719681
## 6 ENSG00000270123.1 Myoblast 1.0000000
semisup_clustering_genes <- unique(selectTopMarkers(marker_spec, 500)$gene_id)
HSMM <- setOrderingFilter(HSMM, semisup_clustering_genes)
plot_ordering_genes(HSMM)
## 重新挑选基因,只用黑色高亮的基因来进行聚类。
plot_pc_variance_explained(HSMM, return_all = F) # norm_method = 'log',
HSMM <- reduceDimension(HSMM, max_components=2, num_dim = 2, reduction_method = 'tSNE',
residualModelFormulaStr="~Media + num_genes_expressed", verbose = T)
HSMM <- clusterCells(HSMM, num_clusters=2)
## Distance cutoff calculated to 1.02776
plot_cell_clusters(HSMM, 1, 2, color="CellType")
HSMM <- clusterCells(HSMM,
num_clusters=2,
frequency_thresh=0.1,
cell_type_hierarchy=cth)
## Distance cutoff calculated to 1.02776
plot_cell_clusters(HSMM, 1, 2, color="CellType", markers = c("MYF5", "ANPEP"))
pie <- ggplot(pData(HSMM), aes(x = factor(1), fill = factor(CellType))) +
geom_bar(width = 1)
pie + coord_polar(theta = "y") +
theme(axis.title.x=element_blank(), axis.title.y=element_blank())
主要目的是:Constructing Single Cell Trajectories
发育过程中细胞状态是不断变化的,monocle包利用算法学习所有基因的表达模式来把每个细胞安排到各各自的发展轨迹。 在大多数生物学过程中,参与的细胞通常不是同步发展的,只有单细胞转录组技术才能把处于该过程中各个中间状态的细胞分离开来,而monocle包里面的pseudotime分析方法正是要探究这些。
choose genes that define a cell’s progress
reduce data dimensionality
order cells along the trajectory
其中第一个步骤挑选合适的基因有3种策略,分别是:
Ordering based on genes that differ between clusters
Selecting genes with high dispersion across cells
Ordering cells using known marker genes
HSMM_myo <- HSMM[,pData(HSMM)$CellType == "Myoblast"]
HSMM_myo <- estimateDispersions(HSMM_myo)
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## Removing 143 outliers
## 策略1: Ordering based on genes that differ between clusters
if(F){
diff_test_res <- differentialGeneTest(HSMM_myo[expressed_genes,],
fullModelFormulaStr="~Media")
ordering_genes <- row.names (subset(diff_test_res, qval < 0.01))
}
## 策略2:Selecting genes with high dispersion across cells
disp_table <- dispersionTable(HSMM_myo)
ordering_genes <- subset(disp_table,
mean_expression >= 0.5 &
dispersion_empirical >= 1 * dispersion_fit)$gene_id
HSMM_myo <- setOrderingFilter(HSMM_myo, ordering_genes)
plot_ordering_genes(HSMM_myo)
## Warning: Transformation introduced infinite values in continuous y-axis
## 挑选变异度大的基因,如图所示
HSMM_myo <- reduceDimension(HSMM_myo, max_components=2)
HSMM_myo <- orderCells(HSMM_myo)
## 排序好的细胞可以直接按照发育顺序可视化
plot_cell_trajectory(HSMM_myo, color_by="State")
前面的聚类分析和Pseudotime分析都需要取基因子集,就已经利用过差异分析方法来挑选那些有着显著表达差异的基因。如果对所有的基因来检验,非常耗时。
marker_genes <- row.names(subset(fData(HSMM_myo),
gene_short_name %in% c("MEF2C", "MEF2D", "MYF5",
"ANPEP", "PDGFRA","MYOG",
"TPM1", "TPM2", "MYH2",
"MYH3", "NCAM1", "TNNT1",
"TNNT2", "TNNC1", "CDK1",
"CDK2", "CCNB1", "CCNB2",
"CCND1", "CCNA1", "ID1")))
diff_test_res <- differentialGeneTest(HSMM_myo[marker_genes,],
fullModelFormulaStr="~Media")
# Select genes that are significant at an FDR < 10%
sig_genes <- subset(diff_test_res, qval < 0.1)
sig_genes[,c("gene_short_name", "pval", "qval")]
## gene_short_name pval qval
## ENSG00000081189.9 MEF2C 8.463396e-20 4.443283e-19
## ENSG00000105048.12 TNNT1 3.017738e-12 7.921562e-12
## ENSG00000109063.9 MYH3 4.105825e-33 4.311116e-32
## ENSG00000111049.3 MYF5 1.300906e-30 9.106344e-30
## ENSG00000114854.3 TNNC1 1.721612e-18 7.230769e-18
## ENSG00000118194.14 TNNT2 2.232213e-37 4.687647e-36
## ENSG00000122180.4 MYOG 2.532610e-12 7.597830e-12
## ENSG00000123374.6 CDK2 3.017043e-02 3.959868e-02
## ENSG00000125414.13 MYH2 6.221763e-06 1.005054e-05
## ENSG00000125968.7 ID1 1.734006e-05 2.601009e-05
## ENSG00000134057.10 CCNB1 4.502654e-11 1.050619e-10
## ENSG00000140416.15 TPM1 9.914869e-08 1.892839e-07
## ENSG00000149294.12 NCAM1 2.473279e-18 8.656478e-18
## ENSG00000157456.3 CCNB2 1.529020e-07 2.675785e-07
## ENSG00000170312.11 CDK1 5.316306e-08 1.116424e-07
## ENSG00000198467.8 TPM2 9.205156e-04 1.288722e-03
## 可以看到挑选的都是显著差异表达的基因。
还可以挑选其中几个基因来可视化看看它们是如何在不同组差异表达的。这个画图函数自己都可以写。
MYOG_ID1 <- HSMM_myo[row.names(subset(fData(HSMM_myo),
gene_short_name %in% c("MYOG", "CCNB2"))),]
plot_genes_jitter(MYOG_ID1, grouping="Media", ncol=2)
这样就可以测试某些基因,是否能区分细胞群体的不同类型及状态
to_be_tested <- row.names(subset(fData(HSMM),
gene_short_name %in% c("UBC", "NCAM1", "ANPEP")))
cds_subset <- HSMM[to_be_tested,]
diff_test_res <- differentialGeneTest(cds_subset, fullModelFormulaStr="~CellType")
diff_test_res[,c("gene_short_name", "pval", "qval")]
## gene_short_name pval qval
## ENSG00000149294.12 NCAM1 2.853848e-92 8.561545e-92
## ENSG00000150991.10 UBC 2.852264e-01 2.852264e-01
## ENSG00000166825.9 ANPEP 4.723193e-15 7.084790e-15
plot_genes_jitter(cds_subset, grouping="CellType", color_by="CellType",
nrow=1, ncol=NULL, plot_trend=TRUE)
## Warning: Computation failed in `stat_summary()`:
## Hmisc package required for this function
full_model_fits <- fitModel(cds_subset, modelFormulaStr="~CellType")
reduced_model_fits <- fitModel(cds_subset, modelFormulaStr="~1")
diff_test_res <- compareModels(full_model_fits, reduced_model_fits)
diff_test_res
## status family pval qval
## ENSG00000149294.12 OK negbinomial.size 2.853848e-92 8.561545e-92
## ENSG00000150991.10 OK negbinomial.size 2.852264e-01 2.852264e-01
## ENSG00000166825.9 OK negbinomial.size 4.723193e-15 7.084790e-15
plot_genes_in_pseudotime(cds_subset, color_by="Hours")
dpFeature: Selecting features from dense cell clusters
Reversed graph embedding
DRTree: Dimensionality Reduction via Learning a Tree
DDRTree: discriminative dimensionality reduction via learning a tree
Census: a normalization method to convert of single-cell mRNA transcript to relative transcript counts.
BEAM : to test for branch-dependent gene expression by formulating the problem as a contrast between two negative binomial GLMs.
Branch time point detection algorithm :
算法讲起来,就复杂了,略过。
鉴于这个单细胞转录组系列阅读量不高,为避免耽误公众号发展,后续会替换蹭其它内容。但是会组建单细胞数据处理交流小组,明天公布
https://mp.weixin.qq.com/s/zCfDkxbVTxjFQ5QAIULYjA