Closed ixxmu closed 5 months ago
生信益站,一点就有益
!祝友友们天天开心,月月发 CNS~
如果不了解相关系统的基础生物学知识,单细胞转录组很难进行注释。即使具备这些知识,由于缺乏由批量 RNA 测序、流式细胞术、其他单细胞 RNA 测序平台等定义的常见标记基因的可检测表达,准确识别仍具有挑战性。
clustifyr通过提供使用批量 RNA 测序数据或标记基因列表(排序或未排序)自动注释单个细胞或簇的功能解决了这个问题。clustifyr允许对单细胞 RNA 测序数据集和参考数据之间的计算相似性进行探索性分析。
要安装clustifyr,必须安装BiocManager。
install.packages("BiocManager")
BiocManager::install("clustifyr")
在此示例中,从外周血单核细胞 (PBMC) 中获取 10x Genomics 3’ scRNA-seq 数据集,并使用从 CITE-seq 实验分配的 scRNA-seq 细胞簇注释细胞簇(使用 Seurat 识别)。
library(clustifyr)
library(ggplot2)
library(cowplot)
# Matrix of normalized single-cell RNA-seq counts
pbmc_matrix <- clustifyr::pbmc_matrix_small
# meta.data table containing cluster assignments for each cell
# The table that we are using also contains the known cell identities in the "classified" column
pbmc_meta <- clustifyr::pbmc_meta
为了识别细胞类型,clustifyr()函数需要几个输入:
当使用 scRNA-seq 计数矩阵时,clustifyr()将返回每个细胞类型和簇的相关系数矩阵,其中行名与簇编号相对应。
# Calculate correlation coefficients for each cluster (spearman by default)
vargenes <- pbmc_vargenes[1:500]
res <- clustify(
input = pbmc_matrix, # matrix of normalized scRNA-seq counts (or SCE/Seurat object)
metadata = pbmc_meta, # meta.data table containing cell clusters
cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters
ref_mat = cbmc_ref, # matrix of RNA-seq expression data for each cell type
query_genes = vargenes # list of highly varible genes identified with Seurat
)
# Peek at correlation matrix
res[1:5, 1:5]
#> B CD14+ Mono CD16+ Mono CD34+ CD4 T
#> 0 0.6563466 0.6454029 0.6485863 0.7089861 0.8804508
#> 1 0.6394363 0.6388404 0.6569401 0.7027430 0.8488750
#> 2 0.5524081 0.9372089 0.8930158 0.5879264 0.5347312
#> 3 0.8945380 0.5801453 0.6146857 0.6955897 0.6566739
#> 4 0.5711643 0.5623870 0.5826233 0.6280913 0.7467347
# Call cell types
res2 <- cor_to_call(
cor_mat = res, # matrix correlation coefficients
cluster_col = "seurat_clusters" # name of column in meta.data containing cell clusters
)
res2[1:5, ]
#> # A tibble: 5 × 3
#> # Groups: seurat_clusters [5]
#> seurat_clusters type r
#> <chr> <chr> <dbl>
#> 1 3 B 0.895
#> 2 2 CD14+ Mono 0.937
#> 3 5 CD16+ Mono 0.929
#> 4 0 CD4 T 0.880
#> 5 1 CD4 T 0.849
# Insert into original metadata as "type" column
pbmc_meta2 <- call_to_metadata(
res = res2, # data.frame of called cell type for each cluster
metadata = pbmc_meta, # original meta.data table containing cell clusters
cluster_col = "seurat_clusters" # name of column in meta.data containing cell clusters
)
为了可视化clustifyr()结果,我们可以使用该 plot_cor_heatmap()函数绘制每个簇和每种细胞类型的相关系数。
# Create heatmap of correlation coefficients using clustifyr() output
plot_cor_heatmap(cor_mat = res)
clustifyr还提供了将相关系数叠加在预先计算的 tSNE 嵌入(或任何其他降维方法的嵌入)上的函数。
# Overlay correlation coefficients on UMAPs for the first two cell types
corr_umaps <- plot_cor(
cor_mat = res, # matrix of correlation coefficients from clustifyr()
metadata = pbmc_meta, # meta.data table containing UMAP or tSNE data
data_to_plot = colnames(res)[1:2], # name of cell type(s) to plot correlation coefficients
cluster_col = "seurat_clusters" # name of column in meta.data containing cell clusters
)
plot_grid(
plotlist = corr_umaps,
rel_widths = c(0.47, 0.53)
)
plot_best_call()函数可用于将每个簇标记为具有最高相关系数的细胞类型。使用该pot_dims()函数,我们还可以绘制每个簇的已知身份,这些身份存储在 meta.data 表的“classified”列中。下图显示,参考 RNA-seq 数据和 10x Genomics scRNA-seq 数据集之间的最高相关性仅限于正确的细胞簇。
# Label clusters with clustifyr cell identities
clustifyr_types <- plot_best_call(
cor_mat = res, # matrix of correlation coefficients from clustifyr()
metadata = pbmc_meta, # meta.data table containing UMAP or tSNE data
do_label = TRUE, # should the feature label be shown on each cluster?
do_legend = FALSE, # should the legend be shown?
do_repel = FALSE, # use ggrepel to avoid overlapping labels
cluster_col = "seurat_clusters"
) +
ggtitle("clustifyr cell types")
# Compare clustifyr results with known cell identities
known_types <- plot_dims(
data = pbmc_meta, # meta.data table containing UMAP or tSNE data
feature = "classified", # name of column in meta.data to color clusters by
do_label = TRUE, # should the feature label be shown on each cluster?
do_legend = FALSE, # should the legend be shown?
do_repel = FALSE
) +
ggtitle("Known cell types")
plot_grid(known_types, clustifyr_types)
clustify_lists()函数允许根据已知标记基因分配细胞类型。该函数需要一个包含每种目标细胞类型标记的表格。可以使用多种统计测试来分配细胞类型,包括超几何、Jaccard、Spearman 和 GSEA。
# Take a peek at marker gene table
cbmc_m
#> CD4 T CD8 T Memory CD4 T CD14+ Mono Naive CD4 T NK B CD16+ Mono
#> 1 ITM2A CD8B ADCY2 S100A8 CDHR3 GNLY IGHM CDKN1C
#> 2 TXNIP CD8A PTGDR2 S100A9 DICER1-AS1 NKG7 CD79A HES4
#> 3 AES S100B CD200R1 LYZ RAD9A CST7 MS4A1 CKB
#> CD34+ Eryth Mk DC pDCs
#> 1 SPINK2 HBM PF4 ENHO LILRA4
#> 2 C1QTNF4 AHSP SDPR CD1E TPM2
#> 3 KIAA0125 CA1 TUBB1 NDRG2 SCT
# Available metrics include: "hyper", "jaccard", "spearman", "gsea"
list_res <- clustify_lists(
input = pbmc_matrix, # matrix of normalized single-cell RNA-seq counts
metadata = pbmc_meta, # meta.data table containing cell clusters
cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters
marker = cbmc_m, # list of known marker genes
metric = "pct" # test to use for assigning cell types
)
# View as heatmap, or plot_best_call
plot_cor_heatmap(
cor_mat = list_res, # matrix of correlation coefficients from clustify_lists()
cluster_rows = FALSE, # cluster by row?
cluster_columns = FALSE, # cluster by column?
legend_title = "% expressed" # title of heatmap legend
)
# Downstream functions same as clustify()
# Call cell types
list_res2 <- cor_to_call(
cor_mat = list_res, # matrix correlation coefficients
cluster_col = "seurat_clusters" # name of column in meta.data containing cell clusters
)
# Insert into original metadata as "list_type" column
pbmc_meta3 <- call_to_metadata(
res = list_res2, # data.frame of called cell type for each cluster
metadata = pbmc_meta, # original meta.data table containing cell clusters
cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters
rename_prefix = "list_" # set a prefix for the new column
)
clustifyr还可以使用 SingleCellExperiment对象作为输入并返回一个新 SingleCellExperiment对象,其中将细胞类型作为 colData 中的一列添加。
library(SingleCellExperiment)
sce <- sce_pbmc()
res <- clustify(
input = sce, # an SCE object
ref_mat = cbmc_ref, # matrix of RNA-seq expression data for each cell type
cluster_col = "clusters", # name of column in meta.data containing cell clusters
obj_out = TRUE # output SCE object with cell type inserted as "type" column
)
colData(res)[1:10, c("type", "r")]
#> DataFrame with 10 rows and 2 columns
#> type r
#> <character> <numeric>
#> AAACATACAACCAC CD4 T 0.861083
#> AAACATTGAGCTAC B 0.909358
#> AAACATTGATCAGC CD4 T 0.861083
#> AAACCGTGCTTCCG CD14+ Mono 0.914543
#> AAACCGTGTATGCG NK 0.894090
#> AAACGCACTGGTAC CD4 T 0.861083
#> AAACGCTGACCAGT NK 0.825784
#> AAACGCTGGTTCTT NK 0.825784
#> AAACGCTGTAGCCA CD4 T 0.889149
#> AAACGCTGTTTCTG CD16+ Mono 0.929491
clustifyr还可以使用Seurat对象作为输入并返回一个新Seurat对象,其中将细胞类型添加为meta数据中的一列。
so <- so_pbmc()
res <- clustify(
input = so, # a Seurat object
ref_mat = cbmc_ref, # matrix of RNA-seq expression data for each cell type
cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters
obj_out = TRUE # output Seurat object with cell type inserted as "type" column
)
res@meta.data[1:10, c("type", "r")]
#> type r
#> AAACATACAACCAC CD4 T 0.8424452
#> AAACATTGAGCTAC B 0.8984684
#> AAACATTGATCAGC CD4 T 0.8424452
#> AAACCGTGCTTCCG CD14+ Mono 0.9319558
#> AAACCGTGTATGCG NK 0.8816119
#> AAACGCACTGGTAC CD4 T 0.8424452
#> AAACGCTGACCAGT NK 0.8147040
#> AAACGCTGGTTCTT NK 0.8147040
#> AAACGCTGTAGCCA CD4 T 0.8736163
#> AAACGCTGTTTCTG CD16+ Mono 0.9321784
最简单的形式是,通过按簇对单细胞 RNA 序列表达矩阵的表达进行平均(还包括取中位数的选项)来构建参考矩阵。支持对数变换矩阵或原始计数矩阵。
new_ref_matrix <- average_clusters(
mat = pbmc_matrix,
metadata = pbmc_meta$classified, # or use metadata = pbmc_meta, cluster_col = "classified"
if_log = TRUE # whether the expression matrix is already log transformed
)
head(new_ref_matrix)
#> B CD14+ Mono CD8 T DC FCGR3A+ Mono
#> PPBP 0.09375021 0.28763857 0.35662599 0.06527347 0.2442300
#> LYZ 1.42699419 5.21550849 1.35146753 4.84714962 3.4034309
#> S100A9 0.62123058 4.91453355 0.58823794 2.53310734 2.6277996
#> IGLL5 2.44576997 0.02434753 0.03284986 0.10986617 0.2581198
#> GNLY 0.37877736 0.53592906 2.53161887 0.46959958 0.2903092
#> FTL 3.66698837 5.86217774 3.37056910 4.21848878 5.9518479
#> Memory CD4 T Naive CD4 T NK Platelet
#> PPBP 0.06494743 0.04883837 0.00000000 6.0941782
#> LYZ 1.39466552 1.40165143 1.32701580 2.5303912
#> S100A9 0.58080250 0.55679700 0.52098541 1.6775692
#> IGLL5 0.04826212 0.03116080 0.05247669 0.2501642
#> GNLY 0.41001072 0.46041901 4.70481754 0.3845813
#> FTL 3.31062958 3.35611600 3.38471536 4.5508242
# For further convenience, a shortcut function for generating reference matrix from `SingleCellExperiment` or `seurat` object is used.
new_ref_matrix_sce <- object_ref(
input = sce, # SCE object
cluster_col = "clusters" # name of column in colData containing cell identities
)
new_ref_matrix_so <- seurat_ref(
seurat_object = so, # Seurat object
cluster_col = "seurat_clusters" # name of column in meta.data containing cell identities
)
tail(new_ref_matrix_so)
#> 0 1 2 3 4
#> RHOC 0.245754269 0.40431050 0.590053057 0.37702525 0.86466156
#> CISH 0.492444272 0.54773003 0.079843557 0.08962348 0.66024943
#> CD27 1.195370020 1.28719850 0.100562312 0.54487892 1.28322681
#> LILRA3 0.004576215 0.03686387 0.544743180 0.00000000 0.03409087
#> CLIC2 0.007570624 0.00000000 0.021200958 0.00000000 0.00000000
#> HEMGN 0.034099324 0.04619359 0.006467157 0.00000000 0.00000000
#> 5 6 7 8
#> RHOC 2.20162898 1.6518448 0.72661911 1.465067
#> CISH 0.10588034 0.1339815 0.18231747 0.000000
#> CD27 0.09640885 0.1600911 0.04639912 0.000000
#> LILRA3 1.35418074 0.0000000 0.00000000 0.000000
#> CLIC2 0.00000000 0.0000000 0.46542832 0.000000
#> HEMGN 0.00000000 0.0000000 0.00000000 1.079083
更多教程请访问:https://rnabioco.github.io/clustifyrdata/articles/otherformats.html
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
#> [3] Biobase_2.62.0 GenomicRanges_1.54.1
#> [5] GenomeInfoDb_1.38.8 IRanges_2.36.0
#> [7] S4Vectors_0.40.2 BiocGenerics_0.48.1
#> [9] MatrixGenerics_1.14.0 matrixStats_1.3.0
#> [11] cowplot_1.1.3 ggplot2_3.5.0
#> [13] clustifyr_1.15.4 BiocStyle_2.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-7 rlang_1.1.3
#> [3] magrittr_2.0.3 clue_0.3-65
#> [5] GetoptLong_1.0.5 compiler_4.3.3
#> [7] png_0.1-8 systemfonts_1.0.6
#> [9] vctrs_0.6.5 shape_1.4.6.1
#> [11] pkgconfig_2.0.3 crayon_1.5.2
#> [13] fastmap_1.1.1 XVector_0.42.0
#> [15] labeling_0.4.3 utf8_1.2.4
#> [17] rmarkdown_2.26 ragg_1.3.0
#> [19] purrr_1.0.2 xfun_0.43
#> [21] zlibbioc_1.48.2 cachem_1.0.8
#> [23] jsonlite_1.8.8 highr_0.10
#> [25] DelayedArray_0.28.0 BiocParallel_1.36.0
#> [27] cluster_2.1.6 parallel_4.3.3
#> [29] R6_2.5.1 bslib_0.7.0
#> [31] RColorBrewer_1.1-3 parallelly_1.37.1
#> [33] jquerylib_0.1.4 Rcpp_1.0.12
#> [35] bookdown_0.39 iterators_1.0.14
#> [37] knitr_1.46 future.apply_1.11.2
#> [39] Matrix_1.6-5 tidyselect_1.2.1
#> [41] abind_1.4-5 yaml_2.3.8
#> [43] doParallel_1.0.17 codetools_0.2-20
#> [45] listenv_0.9.1 lattice_0.22-6
#> [47] tibble_3.2.1 withr_3.0.0
#> [49] evaluate_0.23 future_1.33.2
#> [51] desc_1.4.3 circlize_0.4.16
#> [53] pillar_1.9.0 BiocManager_1.30.22
#> [55] foreach_1.5.2 generics_0.1.3
#> [57] sp_2.1-3 RCurl_1.98-1.14
#> [59] munsell_0.5.1 scales_1.3.0
#> [61] globals_0.16.3 glue_1.7.0
#> [63] tools_4.3.3 data.table_1.15.4
#> [65] fgsea_1.28.0 fs_1.6.3
#> [67] dotCall64_1.1-1 fastmatch_1.1-4
#> [69] grid_4.3.3 tidyr_1.3.1
#> [71] colorspace_2.1-0 GenomeInfoDbData_1.2.11
#> [73] cli_3.6.2 textshaping_0.3.7
#> [75] spam_2.10-0 fansi_1.0.6
#> [77] S4Arrays_1.2.1 ComplexHeatmap_2.18.0
#> [79] dplyr_1.1.4 gtable_0.3.4
#> [81] sass_0.4.9 digest_0.6.35
#> [83] progressr_0.14.0 SparseArray_1.2.4
#> [85] farver_2.1.1 rjson_0.2.21
#> [87] htmlwidgets_1.6.4 SeuratObject_5.0.1
#> [89] memoise_2.0.1 entropy_1.3.1
#> [91] htmltools_0.5.8.1 pkgdown_2.0.9
#> [93] lifecycle_1.0.4 httr_1.4.7
#> [95] GlobalOptions_0.1.2
OK,今天的分享到此为止。咱们明天见~
❝对
本篇文章有疑问
,或者有科研服务需求
的友友可以在益站发消息留言
,也欢迎各位童鞋扫下面的二维码
加入我们的
https://mp.weixin.qq.com/s/GQiRQIbr1rx5xKYzC1My3Q