ncborcherding / scRepertoire

A toolkit for single-cell immune profiling
https://www.borch.dev/uploads/screpertoire/
MIT License
311 stars 54 forks source link

The clonalCluster() function cannot generate clusters. #433

Closed HaiqiaoSun closed 3 weeks ago

HaiqiaoSun commented 4 weeks ago

My combined <- combineBCR(csv1, samples = c("sp_exp"), ID = c(NULL)) successfully created Cluster in the CTstrict column with the default threshold of 0.85, which seems perfectly correct.

combined[["sp_exp"]][["CTstrict"]] [1] "IGH.1376.IGHV3-43_IGLC:Cluster.12.IGKV2-24" "IGH.1376.IGHV3-43_IGLC:Cluster.12.IGKV2-24"
[3] "IGH.1505.IGHV1-18_IGLC:Cluster.12.IGKV2-24" "IGH.1741.IGHV3-64D_IGLC:Cluster.12.IGKV2-24"

However, when using the clonalCluster function as follows:

igraph.object <- clonalCluster(combined, chain = "IGH", sequence = "aa", threshold = 0.85, samples = NULL, exportGraph = TRUE) it returns an error: clonalCluster(combined, chain = "IGH", sequence = "aa", threshold = 0.85, : No clusters detected with current parameters.

Do you have any ideas about this? Thank you.

ncborcherding commented 4 weeks ago

Hey @HaiqiaoSun

Thanks for reaching out.

Please make a reproducible example for this issue so I know what you are experiencing.

Also with the example, please make sure to include the output of sessionInfo().

Thanks a lot, Nick

HaiqiaoSun commented 4 weeks ago

Thank you for your prompt response. Here is the data and example I used.

https://gist.github.com/HaiqiaoSun/698bc2d3e14950feb085fa19a6d56077

> library(scRepertoire)
> csv1 <- read.csv("E:/R/filtered_contig_annotations.csv", stringsAsFactors = F)
> combined <- combineBCR(csv1, samples = c("sp_exp"),
+                        ID = c(NULL))
> head(combined[[1]])
                      barcode sample                          IGH                  cdr3_aa1
1 sp_exp_ATCCACGCCGCAGTCCAGTA sp_exp  IGHV3-33.IGHD3-3.IGHJ6.IGHM CARDRQVLRFLEWLLPYYYYGMDVW
2 sp_exp_AGTATCTAACCACTTGTAAG sp_exp        IGHV3-7.NA.IGHJ4.IGHM               CARDGGDDGYW
3 sp_exp_AACTGGCGAGCAATAACACT sp_exp IGHV3-21.IGHD3-10.IGHJ4.IGHM      CARIPPVTMVRGVIKFFDYW
4 sp_exp_GACAGCGAATGGAGTTAAGG sp_exp       IGHV3-53.NA.IGHJ4.IGHM               CARDSYSSHYW
5 sp_exp_TTGTCGAAGGCTCATACGCT sp_exp       IGHV4-34.NA.IGHJ6.IGHM       CASGIKQQLAYYYYGMDVW
6 sp_exp_TAGCGTTCGTAACGATTCAA sp_exp                         <NA>                      <NA>
                                                                     cdr3_nt1                IGLC    cdr3_aa2
1 TGTGCGAGAGATCGTCAAGTATTACGATTTTTGGAGTGGTTATTACCCTACTACTACTACGGTATGGACGTCTGG IGKV2-24.IGKJ2.IGKC CTQATQSPYTF
2                                           TGTGCGAGAGATGGCGGTGACGACGGGTACTGG IGKV2-24.IGKJ2.IGKC CTQATQFPYTF
3                TGTGCGAGAATACCTCCGGTTACTATGGTTCGGGGAGTTATTAAATTCTTTGACTACTGG IGKV2-24.IGKJ2.IGKC CTQATQFPYTF
4                                           TGTGCGAGAGATTCGTATAGCAGCCATTACTGG IGKV2-24.IGKJ1.IGKC CTQATQFPWTF
5                   TGTGCGAGTGGAATTAAGCAGCAGCTGGCCTACTACTACTACGGTATGGACGTCTGG IGKV2-24.IGKJ4.IGKC CTQATQFPHTF
6                                                                        <NA> IGKV2-24.IGKJ5.IGKC  CTQATQFPHF
                           cdr3_nt2                                           CTgene
1 TGCACGCAAGCTACACAATCCCCGTACACTTTT  IGHV3-33.IGHD3-3.IGHJ6.IGHM_IGKV2-24.IGKJ2.IGKC
2 TGCACGCAAGCTACACAATTTCCGTACACTTTT        IGHV3-7.NA.IGHJ4.IGHM_IGKV2-24.IGKJ2.IGKC
3 TGCACGCAAGCTACACAATTTCCGTACACTTTT IGHV3-21.IGHD3-10.IGHJ4.IGHM_IGKV2-24.IGKJ2.IGKC
4 TGCACGCAAGCTACACAATTTCCGTGGACGTTC       IGHV3-53.NA.IGHJ4.IGHM_IGKV2-24.IGKJ1.IGKC
5 TGCACGCAAGCTACACAATTTCCTCACACTTTC       IGHV4-34.NA.IGHJ6.IGHM_IGKV2-24.IGKJ4.IGKC
6    TGCACGCAAGCTACACAATTTCCTCACTTC                           NA_IGKV2-24.IGKJ5.IGKC
                                                                                                           CTnt
1 TGTGCGAGAGATCGTCAAGTATTACGATTTTTGGAGTGGTTATTACCCTACTACTACTACGGTATGGACGTCTGG_TGCACGCAAGCTACACAATCCCCGTACACTTTT
2                                           TGTGCGAGAGATGGCGGTGACGACGGGTACTGG_TGCACGCAAGCTACACAATTTCCGTACACTTTT
3                TGTGCGAGAATACCTCCGGTTACTATGGTTCGGGGAGTTATTAAATTCTTTGACTACTGG_TGCACGCAAGCTACACAATTTCCGTACACTTTT
4                                           TGTGCGAGAGATTCGTATAGCAGCCATTACTGG_TGCACGCAAGCTACACAATTTCCGTGGACGTTC
5                   TGTGCGAGTGGAATTAAGCAGCAGCTGGCCTACTACTACTACGGTATGGACGTCTGG_TGCACGCAAGCTACACAATTTCCTCACACTTTC
6                                                                             NA_TGCACGCAAGCTACACAATTTCCTCACTTC
                                   CTaa                                         CTstrict
1 CARDRQVLRFLEWLLPYYYYGMDVW_CTQATQSPYTF IGH:Cluster.32.IGHV3-33_IGLC:Cluster.14.IGKV2-24
2               CARDGGDDGYW_CTQATQFPYTF         IGH.403.IGHV3-7_IGLC:Cluster.14.IGKV2-24
3      CARIPPVTMVRGVIKFFDYW_CTQATQFPYTF         IGH.33.IGHV3-21_IGLC:Cluster.14.IGKV2-24
4               CARDSYSSHYW_CTQATQFPWTF       IGH.1202.IGHV3-53_IGLC:Cluster.14.IGKV2-24
5       CASGIKQQLAYYYYGMDVW_CTQATQFPHTF IGH:Cluster.56.IGHV4-34_IGLC:Cluster.14.IGKV2-24
6                         NA_CTQATQFPHF                   NA.NA_IGLC:Cluster.14.IGKV2-24
> #igraph
> igraph.object <- clonalCluster(combined,
+                                chain = "IGH", 
+                                sequence = "aa",
+                                threshold = 0.85,
+                                samples = NULL, 
+                                exportGraph = TRUE)
错误于clonalCluster(combined, chain = "IGH", sequence = "aa", threshold = 0.85, : 
  No clusters detected with current parameters.`
HaiqiaoSun commented 4 weeks ago

sessionInfo() R version 4.4.1 (2024-06-14 ucrt) Platform: x86_64-w64-mingw32/x64 Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale: [1] LC_COLLATE=Chinese (Simplified)_China.utf8 LC_CTYPE=Chinese (Simplified)_China.utf8
[3] LC_MONETARY=Chinese (Simplified)_China.utf8 LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.utf8

time zone: Asia/Shanghai tzcode source: internal

attached base packages: [1] grid stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] ggraph_2.2.1 scater_1.32.1 scuttle_1.14.0 ggsci_3.2.0
[5] scRepertoire_2.0.7 plot1cell_0.0.0.9000 ComplexHeatmap_2.20.0 wordcloud_2.6
[9] simplifyEnrichment_1.14.1 GEOquery_2.72.0 rlang_1.1.4 cowplot_1.1.3
[13] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.28.1 AnnotationFilter_1.28.0 GenomicFeatures_1.56.0
[17] AnnotationDbi_1.66.0 loomR_0.2.0 itertools_0.1-3 iterators_1.0.14
[21] R6_2.5.1 hdf5r_1.3.11 data.table_1.16.2 DoubletFinder_2.0.4
[25] ComplexUpset_1.3.3 purrr_1.0.2 ggbeeswarm_0.7.2 reshape2_1.4.4
[29] biomaRt_2.60.1 RColorBrewer_1.1-3 progress_1.2.3 scales_1.3.0
[33] MASS_7.3-61 ggh4x_0.2.8 circlize_0.4.16 plotly_4.10.4
[37] ggrastr_1.0.2 shiny_1.9.1 Matrix_1.7-1 SeuratWrappers_0.3.0
[41] msigdbr_7.5.1 SCPA_1.6.2 Nebulosa_1.14.0 ggplot2_3.5.1
[45] viridis_0.6.5 viridisLite_0.4.2 patchwork_1.3.0 Seurat_5.1.0
[49] SeuratObject_5.0.2 sp_2.1-4 dplyr_1.1.4 monocle3_1.2.9
[53] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0 GenomicRanges_1.56.2 GenomeInfoDb_1.40.1
[57] IRanges_2.38.1 S4Vectors_0.42.1 MatrixGenerics_1.16.0 matrixStats_1.4.1
[61] Biobase_2.64.0 BiocGenerics_0.50.0

loaded via a namespace (and not attached): [1] igraph_2.1.1 hash_2.2.6.3 ica_1.0-3 Formula_1.2-5
[5] zlibbioc_1.50.0 tidyselect_1.2.1 bit_4.5.0 doParallel_1.0.17
[9] clue_0.3-65 tm_0.7-14 lattice_0.22-6 rjson_0.2.23
[13] evmix_2.12 blob_1.2.4 stringr_1.5.1 S4Arrays_1.4.1
[17] parallel_4.4.1 png_0.1-8 cli_3.6.3 ProtGenerics_1.36.0
[21] goftest_1.2-3 textshaping_0.4.0 BiocIO_1.14.0 BiocNeighbors_1.22.0
[25] stringdist_0.9.12 uwot_0.2.2 curl_5.2.3 mime_0.12
[29] evaluate_1.0.1 leiden_0.4.3.1 stringi_1.8.4 backports_1.5.0
[33] gsl_2.1-8 XML_3.99-0.17 httpuv_1.6.15 magrittr_2.0.3
[37] rappdirs_0.3.3 splines_4.4.1 mclust_6.1.1 org.Hs.eg.db_3.19.1
[41] sctransform_0.4.1 DBI_1.2.3 terra_1.7-83 jquerylib_0.1.4
[45] withr_3.0.2 systemfonts_1.1.0 lmtest_0.9-40 tidygraph_1.3.1
[49] rtracklayer_1.64.0 BiocManager_1.30.25 htmlwidgets_1.6.4 fs_1.6.4
[53] ggrepel_0.9.6 labeling_0.4.3 SparseArray_1.4.8 reticulate_1.39.0
[57] zoo_1.8-12 XVector_0.44.0 knitr_1.48 UCSC.utils_1.0.0
[61] foreach_1.5.2 fansi_1.0.6 quantreg_5.99 R.oo_1.26.0
[65] RSpectra_0.16-2 irlba_2.3.5.1 fastDummies_1.7.4 lazyeval_0.2.2
[69] yaml_2.3.10 survival_3.7-0 scattermore_1.2 crayon_1.5.3
[73] RcppAnnoy_0.0.22 tidyr_1.3.1 progressr_0.14.0 tweenr_2.0.3
[77] later_1.3.2 ggridges_0.5.6 codetools_0.2-20 base64enc_0.1-3
[81] GlobalOptions_0.1.2 KEGGREST_1.44.1 Rtsne_0.17 shape_1.4.6.1
[85] limma_3.60.6 Rsamtools_2.20.0 filelock_1.0.3 foreign_0.8-87
[89] pkgconfig_2.0.3 xml2_1.3.6 spatstat.univar_3.0-1 GenomicAlignments_1.40.0 [93] iNEXT_3.0.1 evd_2.3-7.1 spatstat.sparse_3.1-0 xtable_1.8-4
[97] highr_0.11 plyr_1.8.9 httr_1.4.7 tools_4.4.1
[101] globals_0.16.3 beeswarm_0.4.0 htmlTable_2.4.3 checkmate_2.3.2
[105] nlme_3.1-166 dbplyr_2.5.0 MatrixModels_0.5-3 assertthat_0.2.1
[109] lme4_1.1-35.5 digest_0.6.37 farver_2.1.2 tzdb_0.4.0
[113] ks_1.14.3 yulab.utils_0.1.7 rpart_4.1.23 glue_1.8.0
[117] cachem_1.1.0 BiocFileCache_2.12.0 nbpMatching_1.5.6 polyclip_1.10-7
[121] Hmisc_5.1-3 generics_0.1.3 proxyC_0.4.1 Biostrings_2.72.1
[125] ggalluvial_0.12.5 mvtnorm_1.3-1 parallelly_1.38.0 multicross_2.1.0
[129] statmod_1.5.0 RcppHNSW_0.6.0 ragg_1.3.3 crossmatch_1.4-0
[133] ScaledMatrix_1.12.0 minqa_1.2.8 pbapply_1.7-2 httr2_1.0.5
[137] spam_2.11-0 utf8_1.2.4 graphlayouts_1.2.0 gridExtra_2.3
[141] GenomeInfoDbData_1.2.12 R.utils_2.12.3 RCurl_1.98-1.16 memoise_2.0.1
[145] rmarkdown_2.28 R.methodsS3_1.8.2 future_1.34.0 RANN_2.6.2
[149] Cairo_1.6-2 spatstat.data_3.1-2 rstudioapi_0.17.1 cluster_2.1.6
[153] spatstat.utils_3.1-0 hms_1.1.3 fitdistrplus_1.2-1 munsell_0.5.1
[157] colorspace_2.1-1 ggdendro_0.2.0 DelayedMatrixStats_1.26.0 sparseMatrixStats_1.16.0 [161] truncdist_1.0-2 dotCall64_1.2 ggforce_0.4.2 xfun_0.48
[165] remotes_2.5.0 abind_1.4-8 GOSemSim_2.30.2 tibble_3.2.1
[169] readr_2.1.5 bitops_1.0-9 promises_1.3.0 RSQLite_2.3.7
[173] leidenbase_0.1.31 DelayedArray_0.30.1 proxy_0.4-27 GO.db_3.19.1
[177] compiler_4.4.1 prettyunits_1.2.0 boot_1.3-31 beachmat_2.20.0
[181] SparseM_1.84-2 listenv_0.9.1 Rcpp_1.0.13 BiocSingular_1.20.0
[185] tensor_1.5 BiocParallel_1.38.0 cubature_2.1.1 babelgene_22.9
[189] spatstat.random_3.3-2 fastmap_1.2.0 vipor_0.4.7 ROCR_1.0-11
[193] rsvd_1.0.5 nnet_7.3-19 gtable_0.3.6 KernSmooth_2.23-24
[197] miniUI_0.1.1.1 deldir_2.0-4 htmltools_0.5.8.1 bit64_4.5.2
[201] spatstat.explore_3.3-3 lifecycle_1.0.4 NLP_0.3-0 nloptr_2.1.1
[205] restfulr_0.0.15 sass_0.4.9 vctrs_0.6.5 isoband_0.2.7
[209] rsconnect_1.3.2 VGAM_1.1-12 slam_0.1-54 spatstat.geom_3.3-3
[213] future.apply_1.11.3 pracma_2.4.4 bslib_0.8.0 pillar_1.9.0
[217] jsonlite_1.8.9 GetoptLong_1.0.5

HaiqiaoSun commented 3 weeks ago

I found that setting the exportGraph parameter of the clonalCluster function to F allows the function to run and generate the igraph.object, but it does not produce the clusters.

To generate an igraph similar to the one which Nick did in https://www.bioconductor.org/packages/devel/bioc/vignettes/scRepertoire/inst/doc/vignette.html#1_Introduction, I tried using the cluster generated by combineBCR function to draw the clone connection plot, which produces a similar effect, though not as visually appealing as Nick's example. P10 plot git Below is the code I used. Note that my dataset does not include multiple groups, and I did not adjust the node size based on the cluster size.

#igraph BCR connection plot

library(scRepertoire)
library(igraph)
library(ggraph)
library(tidyverse)

# Load data
csv1 <- read.csv("E:/R/filtered_contig_annotations.csv", stringsAsFactors = F)
combined <- combineBCR(csv1, samples = c("sp_exp"), ID = NULL,
                       threshold = 0.95)

# Create clonal graph object
igraph.object <- clonalCluster(combined,
                               chain = "IGH", 
                               sequence = "aa",
                               threshold = 0.95,
                               samples = NULL, 
                               exportGraph = F)

# Convert graph object to a dataframe
df <- as.data.frame(igraph.object)
if (!is.data.frame(df)) {
  df <- as.data.frame(df)
}

# Extract IGH:Cluster part from sp_exp.CTstrict column to define each node's cluster
df <- df %>%
  dplyr::mutate(IGH_cluster = sapply(strsplit(as.character(sp_exp.CTstrict), "_"), function(x) x[1]))

# Filter rows containing IGH cluster information and remove NA values
df_igh <- df %>% dplyr::filter(!is.na(IGH_cluster))

# Keep only clusters with at least 3 nodes
df_filtered <- df_igh %>%
  group_by(IGH_cluster) %>%
  dplyr::filter(n() >= 3) %>%  # Keep only clusters with at least 3 nodes
  ungroup()

# Create an edge list, connecting samples within the same cluster
edge_list <- df_filtered %>% 
  group_by(IGH_cluster) %>% 
  summarise(pairs = list(combn(sp_exp.barcode, 2, simplify = FALSE))) %>% 
  unnest_longer(pairs) %>% 
  mutate(from = sapply(pairs, `[`, 1),
         to = sapply(pairs, `[`, 2)) %>% 
  dplyr::select(from, to)

# Remove possible NA values
edge_list <- edge_list %>% dplyr::filter(!is.na(from) & !is.na(to))

# Create graph object, only keep nodes that are in the filtered edge list
graph <- graph_from_data_frame(edge_list, directed = TRUE)

# Update graph node attributes
V(graph)$cluster <- df_filtered$IGH_cluster[match(V(graph)$name, df_filtered$sp_exp.barcode)]
V(graph)$size <- 1.5  # Node size

# Assign a unique color to each cluster
library(viridis)
unique_clusters <- unique(V(graph)$cluster)
color_map <- setNames(viridis(length(unique_clusters)), unique_clusters)
col_samples <- color_map[V(graph)$cluster]

# Use MDS layout for graph plotting
layout <- layout_with_mds(graph)

plot(
  graph,
  layout = layout,
  vertex.size = V(graph)$size,
  vertex.label = NA,
  edge.arrow.size = 0.1,
  vertex.color = col_samples,
  vertex.frame.width = 0.2
)
ncborcherding commented 3 weeks ago

@HaiqiaoSun

Thank you so much for the excellent followup and rundown of the problem. I was able to find the issue - evaluation statement during exportGraph = TRUE assumed that there would be more than 1 sample - so feeding in a single sequencing run broke it.

We now combine the entire graph before we evaluate: https://github.com/ncborcherding/scRepertoire/blob/f9a4a123dce119158f3073962dedd5daf703b9ed/R/clonalCluster.R#L134

Here is a rundown of the fixed function:

BCR <- read.csv("https://www.borch.dev/uploads/contigs/b_contigs.csv")
combined <- combineBCR(BCR, 
                       samples = "Patient1", 
                       threshold = 0.85)

igraph.object <- clonalCluster(combined,
                               chain = "IGH", 
                               sequence = "aa",
                               threshold = 0.85,
                               samples = NULL, 
                               exportGraph = TRUE)
plot(igraph.object)
Screenshot 2024-10-31 at 2 40 54 PM

Fix is up in the dev branch - @Qile0317 and I are working on some minor changes before we get this version shipped and pushed to bioconductor/main branch of github

Thanks again, Nick

HaiqiaoSun commented 3 weeks ago

@ncborcherding Thank you for your response and for the continuous updates to scRepertoire. The updated clonalCluster function runs perfectly on my single sequencing dataset. Not only does it facilitate research on T cells, but scRepertoire also provides convenience for B cell researchers like me. Thanks again.

Haiqiao