单细胞全自动注释篇(一)——clustifyr by 生信益站

生信益站，一点就有益！祝友友们天天开心，月月发 CNS~

1 为什么要使用clustifyr？

如果不了解相关系统的基础生物学知识，单细胞转录组很难进行注释。即使具备这些知识，由于缺乏由批量 RNA 测序、流式细胞术、其他单细胞 RNA 测序平台等定义的常见标记基因的可检测表达，准确识别仍具有挑战性。

clustifyr通过提供使用批量 RNA 测序数据或标记基因列表（排序或未排序）自动注释单个细胞或簇的功能解决了这个问题。clustifyr允许对单细胞 RNA 测序数据集和参考数据之间的计算相似性进行探索性分析。

2 安装

要安装clustifyr，必须安装BiocManager。

install.packages("BiocManager")

BiocManager::install("clustifyr")

3 一个简单的例子：10x Genomics PBMCs

在此示例中，从外周血单核细胞 (PBMC) 中获取 10x Genomics 3’ scRNA-seq 数据集，并使用从 CITE-seq 实验分配的 scRNA-seq 细胞簇注释细胞簇（使用 Seurat 识别）。

library(clustifyr)
library(ggplot2)
library(cowplot)

# Matrix of normalized single-cell RNA-seq counts
pbmc_matrix <- clustifyr::pbmc_matrix_small

# meta.data table containing cluster assignments for each cell
# The table that we are using also contains the known cell identities in the "classified" column
pbmc_meta <- clustifyr::pbmc_meta

4 计算相关系数

为了识别细胞类型，clustifyr()函数需要几个输入：

input：SingleCellExperiment 或 Seurat 对象或标准化单细胞 RNA 序列计数矩阵
metadata：包含每个细胞的聚类分配的 meta.data 表（如果给出了 Seurat 对象，则不需要）
ref_mat：包含每种感兴趣的细胞类型的 RNA-seq 表达数据的参考矩阵
query_genes：用于比较的基因列表（可选但推荐）

当使用 scRNA-seq 计数矩阵时，clustifyr()将返回每个细胞类型和簇的相关系数矩阵，其中行名与簇编号相对应。

# Calculate correlation coefficients for each cluster (spearman by default)
vargenes <- pbmc_vargenes[1:500]

res <- clustify(
  input = pbmc_matrix, # matrix of normalized scRNA-seq counts (or SCE/Seurat object)
  metadata = pbmc_meta, # meta.data table containing cell clusters
  cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters
  ref_mat = cbmc_ref, # matrix of RNA-seq expression data for each cell type
  query_genes = vargenes # list of highly varible genes identified with Seurat
)

# Peek at correlation matrix
res[1:5, 1:5]
#>           B CD14+ Mono CD16+ Mono     CD34+     CD4 T
#> 0 0.6563466  0.6454029  0.6485863 0.7089861 0.8804508
#> 1 0.6394363  0.6388404  0.6569401 0.7027430 0.8488750
#> 2 0.5524081  0.9372089  0.8930158 0.5879264 0.5347312
#> 3 0.8945380  0.5801453  0.6146857 0.6955897 0.6566739
#> 4 0.5711643  0.5623870  0.5826233 0.6280913 0.7467347

# Call cell types
res2 <- cor_to_call(
  cor_mat = res,                  # matrix correlation coefficients
  cluster_col = "seurat_clusters" # name of column in meta.data containing cell clusters
)
res2[1:5, ]
#> # A tibble: 5 × 3
#> # Groups:   seurat_clusters [5]
#>   seurat_clusters type           r
#>   <chr>           <chr>      <dbl>
#> 1 3               B          0.895
#> 2 2               CD14+ Mono 0.937
#> 3 5               CD16+ Mono 0.929
#> 4 0               CD4 T      0.880
#> 5 1               CD4 T      0.849

# Insert into original metadata as "type" column
pbmc_meta2 <- call_to_metadata(
  res = res2,                     # data.frame of called cell type for each cluster
  metadata = pbmc_meta,           # original meta.data table containing cell clusters
  cluster_col = "seurat_clusters" # name of column in meta.data containing cell clusters
)

为了可视化clustifyr()结果，我们可以使用该 plot_cor_heatmap()函数绘制每个簇和每种细胞类型的相关系数。

# Create heatmap of correlation coefficients using clustifyr() output
plot_cor_heatmap(cor_mat = res)

5 绘制聚类身份和相关系数

clustifyr还提供了将相关系数叠加在预先计算的 tSNE 嵌入（或任何其他降维方法的嵌入）上的函数。

# Overlay correlation coefficients on UMAPs for the first two cell types
corr_umaps <- plot_cor(
  cor_mat = res,                     # matrix of correlation coefficients from clustifyr()
  metadata = pbmc_meta,              # meta.data table containing UMAP or tSNE data
  data_to_plot = colnames(res)[1:2], # name of cell type(s) to plot correlation coefficients
  cluster_col = "seurat_clusters"    # name of column in meta.data containing cell clusters
)

plot_grid(
  plotlist = corr_umaps,
  rel_widths = c(0.47, 0.53)
)

plot_best_call()函数可用于将每个簇标记为具有最高相关系数的细胞类型。使用该pot_dims()函数，我们还可以绘制每个簇的已知身份，这些身份存储在 meta.data 表的“classified”列中。下图显示，参考 RNA-seq 数据和 10x Genomics scRNA-seq 数据集之间的最高相关性仅限于正确的细胞簇。

# Label clusters with clustifyr cell identities
clustifyr_types <- plot_best_call(
  cor_mat = res,          # matrix of correlation coefficients from clustifyr()
  metadata = pbmc_meta,   # meta.data table containing UMAP or tSNE data
  do_label = TRUE,        # should the feature label be shown on each cluster?
  do_legend = FALSE,      # should the legend be shown?
  do_repel = FALSE,       # use ggrepel to avoid overlapping labels
  cluster_col = "seurat_clusters"
) +
  ggtitle("clustifyr cell types")

# Compare clustifyr results with known cell identities
known_types <- plot_dims(
  data = pbmc_meta,       # meta.data table containing UMAP or tSNE data
  feature = "classified", # name of column in meta.data to color clusters by
  do_label = TRUE,        # should the feature label be shown on each cluster?
  do_legend = FALSE,      # should the legend be shown?
  do_repel = FALSE
) +
  ggtitle("Known cell types")

plot_grid(known_types, clustifyr_types)

6 使用已知标记基因对细胞进行分类

clustify_lists()函数允许根据已知标记基因分配细胞类型。该函数需要一个包含每种目标细胞类型标记的表格。可以使用多种统计测试来分配细胞类型，包括超几何、Jaccard、Spearman 和 GSEA。

# Take a peek at marker gene table
cbmc_m
#>   CD4 T CD8 T Memory CD4 T CD14+ Mono Naive CD4 T   NK     B CD16+ Mono
#> 1 ITM2A  CD8B        ADCY2     S100A8       CDHR3 GNLY  IGHM     CDKN1C
#> 2 TXNIP  CD8A       PTGDR2     S100A9  DICER1-AS1 NKG7 CD79A       HES4
#> 3   AES S100B      CD200R1        LYZ       RAD9A CST7 MS4A1        CKB
#>      CD34+ Eryth    Mk    DC   pDCs
#> 1   SPINK2   HBM   PF4  ENHO LILRA4
#> 2  C1QTNF4  AHSP  SDPR  CD1E   TPM2
#> 3 KIAA0125   CA1 TUBB1 NDRG2    SCT

# Available metrics include: "hyper", "jaccard", "spearman", "gsea"
list_res <- clustify_lists(
  input = pbmc_matrix,             # matrix of normalized single-cell RNA-seq counts
  metadata = pbmc_meta,            # meta.data table containing cell clusters
  cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters
  marker = cbmc_m,                 # list of known marker genes
  metric = "pct"                   # test to use for assigning cell types
)

# View as heatmap, or plot_best_call
plot_cor_heatmap(
  cor_mat = list_res,              # matrix of correlation coefficients from clustify_lists()
  cluster_rows = FALSE,            # cluster by row?
  cluster_columns = FALSE,         # cluster by column?
  legend_title = "% expressed"     # title of heatmap legend
)

# Downstream functions same as clustify()
# Call cell types
list_res2 <- cor_to_call(
  cor_mat = list_res,              # matrix correlation coefficients
  cluster_col = "seurat_clusters"  # name of column in meta.data containing cell clusters
)

# Insert into original metadata as "list_type" column
pbmc_meta3 <- call_to_metadata(
  res = list_res2,                 # data.frame of called cell type for each cluster
  metadata = pbmc_meta,            # original meta.data table containing cell clusters
  cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters
  rename_prefix = "list_"          # set a prefix for the new column
)

7 直接处理SingleCellExperiment对象

clustifyr还可以使用 SingleCellExperiment对象作为输入并返回一个新 SingleCellExperiment对象，其中将细胞类型作为 colData 中的一列添加。

library(SingleCellExperiment)
sce <- sce_pbmc()
res <- clustify(
  input = sce,                # an SCE object
  ref_mat = cbmc_ref,         # matrix of RNA-seq expression data for each cell type
  cluster_col = "clusters",   # name of column in meta.data containing cell clusters
  obj_out = TRUE              # output SCE object with cell type inserted as "type" column
)

colData(res)[1:10, c("type", "r")]
#> DataFrame with 10 rows and 2 columns
#>                       type         r
#>                <character> <numeric>
#> AAACATACAACCAC       CD4 T  0.861083
#> AAACATTGAGCTAC           B  0.909358
#> AAACATTGATCAGC       CD4 T  0.861083
#> AAACCGTGCTTCCG  CD14+ Mono  0.914543
#> AAACCGTGTATGCG          NK  0.894090
#> AAACGCACTGGTAC       CD4 T  0.861083
#> AAACGCTGACCAGT          NK  0.825784
#> AAACGCTGGTTCTT          NK  0.825784
#> AAACGCTGTAGCCA       CD4 T  0.889149
#> AAACGCTGTTTCTG  CD16+ Mono  0.929491

8 直接处理Seurat对象

clustifyr还可以使用Seurat对象作为输入并返回一个新Seurat对象，其中将细胞类型添加为meta数据中的一列。

so <- so_pbmc()
res <- clustify(
  input = so,      # a Seurat object
  ref_mat = cbmc_ref,   # matrix of RNA-seq expression data for each cell type
  cluster_col = "seurat_clusters", # name of column in meta.data containing cell clusters
  obj_out = TRUE  # output Seurat object with cell type inserted as "type" column
)

res@meta.data[1:10, c("type", "r")]
#>                      type         r
#> AAACATACAACCAC      CD4 T 0.8424452
#> AAACATTGAGCTAC          B 0.8984684
#> AAACATTGATCAGC      CD4 T 0.8424452
#> AAACCGTGCTTCCG CD14+ Mono 0.9319558
#> AAACCGTGTATGCG         NK 0.8816119
#> AAACGCACTGGTAC      CD4 T 0.8424452
#> AAACGCTGACCAGT         NK 0.8147040
#> AAACGCTGGTTCTT         NK 0.8147040
#> AAACGCTGTAGCCA      CD4 T 0.8736163
#> AAACGCTGTTTCTG CD16+ Mono 0.9321784

9 从单细胞表达矩阵构建参考矩阵

最简单的形式是，通过按簇对单细胞 RNA 序列表达矩阵的表达进行平均（还包括取中位数的选项）来构建参考矩阵。支持对数变换矩阵或原始计数矩阵。

new_ref_matrix <- average_clusters(
  mat = pbmc_matrix,
  metadata = pbmc_meta$classified, # or use metadata = pbmc_meta, cluster_col = "classified"
  if_log = TRUE                    # whether the expression matrix is already log transformed
)

head(new_ref_matrix)
#>                 B CD14+ Mono      CD8 T         DC FCGR3A+ Mono
#> PPBP   0.09375021 0.28763857 0.35662599 0.06527347    0.2442300
#> LYZ    1.42699419 5.21550849 1.35146753 4.84714962    3.4034309
#> S100A9 0.62123058 4.91453355 0.58823794 2.53310734    2.6277996
#> IGLL5  2.44576997 0.02434753 0.03284986 0.10986617    0.2581198
#> GNLY   0.37877736 0.53592906 2.53161887 0.46959958    0.2903092
#> FTL    3.66698837 5.86217774 3.37056910 4.21848878    5.9518479
#>        Memory CD4 T Naive CD4 T         NK  Platelet
#> PPBP     0.06494743  0.04883837 0.00000000 6.0941782
#> LYZ      1.39466552  1.40165143 1.32701580 2.5303912
#> S100A9   0.58080250  0.55679700 0.52098541 1.6775692
#> IGLL5    0.04826212  0.03116080 0.05247669 0.2501642
#> GNLY     0.41001072  0.46041901 4.70481754 0.3845813
#> FTL      3.31062958  3.35611600 3.38471536 4.5508242

# For further convenience, a shortcut function for generating reference matrix from `SingleCellExperiment` or `seurat` object is used.
new_ref_matrix_sce <- object_ref(
  input = sce,                     # SCE object
  cluster_col = "clusters"       # name of column in colData containing cell identities
)

new_ref_matrix_so <- seurat_ref(
  seurat_object = so,        # Seurat object
  cluster_col = "seurat_clusters"    # name of column in meta.data containing cell identities
)

tail(new_ref_matrix_so)
#>                  0          1           2          3          4
#> RHOC   0.245754269 0.40431050 0.590053057 0.37702525 0.86466156
#> CISH   0.492444272 0.54773003 0.079843557 0.08962348 0.66024943
#> CD27   1.195370020 1.28719850 0.100562312 0.54487892 1.28322681
#> LILRA3 0.004576215 0.03686387 0.544743180 0.00000000 0.03409087
#> CLIC2  0.007570624 0.00000000 0.021200958 0.00000000 0.00000000
#> HEMGN  0.034099324 0.04619359 0.006467157 0.00000000 0.00000000
#>                 5         6          7        8
#> RHOC   2.20162898 1.6518448 0.72661911 1.465067
#> CISH   0.10588034 0.1339815 0.18231747 0.000000
#> CD27   0.09640885 0.1600911 0.04639912 0.000000
#> LILRA3 1.35418074 0.0000000 0.00000000 0.000000
#> CLIC2  0.00000000 0.0000000 0.46542832 0.000000
#> HEMGN  0.00000000 0.0000000 0.00000000 1.079083

更多教程请访问：https://rnabioco.github.io/clustifyrdata/articles/otherformats.html

10 sessionInfo()

#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
#>  [3] Biobase_2.62.0              GenomicRanges_1.54.1       
#>  [5] GenomeInfoDb_1.38.8         IRanges_2.36.0             
#>  [7] S4Vectors_0.40.2            BiocGenerics_0.48.1        
#>  [9] MatrixGenerics_1.14.0       matrixStats_1.3.0          
#> [11] cowplot_1.1.3               ggplot2_3.5.0              
#> [13] clustifyr_1.15.4            BiocStyle_2.30.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-7            rlang_1.1.3            
#>  [3] magrittr_2.0.3          clue_0.3-65            
#>  [5] GetoptLong_1.0.5        compiler_4.3.3         
#>  [7] png_0.1-8               systemfonts_1.0.6      
#>  [9] vctrs_0.6.5             shape_1.4.6.1          
#> [11] pkgconfig_2.0.3         crayon_1.5.2           
#> [13] fastmap_1.1.1           XVector_0.42.0         
#> [15] labeling_0.4.3          utf8_1.2.4             
#> [17] rmarkdown_2.26          ragg_1.3.0             
#> [19] purrr_1.0.2             xfun_0.43              
#> [21] zlibbioc_1.48.2         cachem_1.0.8           
#> [23] jsonlite_1.8.8          highr_0.10             
#> [25] DelayedArray_0.28.0     BiocParallel_1.36.0    
#> [27] cluster_2.1.6           parallel_4.3.3         
#> [29] R6_2.5.1                bslib_0.7.0            
#> [31] RColorBrewer_1.1-3      parallelly_1.37.1      
#> [33] jquerylib_0.1.4         Rcpp_1.0.12            
#> [35] bookdown_0.39           iterators_1.0.14       
#> [37] knitr_1.46              future.apply_1.11.2    
#> [39] Matrix_1.6-5            tidyselect_1.2.1       
#> [41] abind_1.4-5             yaml_2.3.8             
#> [43] doParallel_1.0.17       codetools_0.2-20       
#> [45] listenv_0.9.1           lattice_0.22-6         
#> [47] tibble_3.2.1            withr_3.0.0            
#> [49] evaluate_0.23           future_1.33.2          
#> [51] desc_1.4.3              circlize_0.4.16        
#> [53] pillar_1.9.0            BiocManager_1.30.22    
#> [55] foreach_1.5.2           generics_0.1.3         
#> [57] sp_2.1-3                RCurl_1.98-1.14        
#> [59] munsell_0.5.1           scales_1.3.0           
#> [61] globals_0.16.3          glue_1.7.0             
#> [63] tools_4.3.3             data.table_1.15.4      
#> [65] fgsea_1.28.0            fs_1.6.3               
#> [67] dotCall64_1.1-1         fastmatch_1.1-4        
#> [69] grid_4.3.3              tidyr_1.3.1            
#> [71] colorspace_2.1-0        GenomeInfoDbData_1.2.11
#> [73] cli_3.6.2               textshaping_0.3.7      
#> [75] spam_2.10-0             fansi_1.0.6            
#> [77] S4Arrays_1.2.1          ComplexHeatmap_2.18.0  
#> [79] dplyr_1.1.4             gtable_0.3.4           
#> [81] sass_0.4.9              digest_0.6.35          
#> [83] progressr_0.14.0        SparseArray_1.2.4      
#> [85] farver_2.1.1            rjson_0.2.21           
#> [87] htmlwidgets_1.6.4       SeuratObject_5.0.1     
#> [89] memoise_2.0.1           entropy_1.3.1          
#> [91] htmltools_0.5.8.1       pkgdown_2.0.9          
#> [93] lifecycle_1.0.4         httr_1.4.7             
#> [95] GlobalOptions_0.1.2

益站可以评论、留言啦。赶紧发表您的看法吧~

OK，今天的分享到此为止。咱们明天见~

联系站长

❝
对本篇文章有疑问，或者有科研服务需求的友友可以在益站发消息留言，也欢迎各位童鞋扫下面的二维码加入我们的 QQ 交流群。

ixxmu / mp_duty

单细胞全自动注释篇(一)——clustifyr #5033