ncborcherding / scRepertoire

A toolkit for single-cell immune profiling
https://www.borch.dev/uploads/screpertoire/
MIT License
301 stars 49 forks source link

Conflicting diversity estimates when splitting using clonalDiversity() #123

Closed leeanapeters closed 2 years ago

leeanapeters commented 2 years ago

Hi Nick, thank you for the awesome package!

I was hoping you could help me understand why I would see differing diversity estimates when using objects subsetted by clinical status versus a comprehensive object and split.by in clonalDiversity().

Is this a function of accounting for total # of clones regardless of the variable in split.by with the whole object, versus using total number of clones/clinical status in the subsetted object? Or am I making a mistake here? The difference in the CTL group is pretty marked with the different methods. (top pic is CTL subset, middle T1D subset, bottom using split.by of total object)

redo for screp control div redo for screp t1d div redo total clones screp

Thanks!

ncborcherding commented 2 years ago

Hey Leeana,

You're right - that looks fishy (especially the low variability of faceted data). Would you mind giving me your sessionInfo() so I can do some investigating?

Thanks, Nick

leeanapeters commented 2 years ago

Sure, thanks for the quick reply!

sessionInfo() R version 4.1.1 (2021-08-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats4 parallel stats graphics grDevices [6] utils datasets methods base

other attached packages: [1] scater_1.20.1
[2] scuttle_1.2.1
[3] SingleCellExperiment_1.14.1 [4] SummarizedExperiment_1.22.0 [5] Biobase_2.52.0
[6] GenomicRanges_1.44.0
[7] MatrixGenerics_1.4.3
[8] matrixStats_0.61.0
[9] edgeR_3.34.1
[10] limma_3.48.3
[11] scales_1.1.1
[12] circlize_0.4.13
[13] writexl_1.4.0
[14] purrr_0.3.4
[15] dplyr_1.0.7
[16] plyr_1.8.6
[17] readxl_1.3.1
[18] Biostrings_2.60.2
[19] GenomeInfoDb_1.28.4
[20] XVector_0.32.0
[21] IRanges_2.26.0
[22] S4Vectors_0.30.2
[23] BiocGenerics_0.38.0
[24] scRepertoire_1.3.5
[25] ggplot2_3.3.5
[26] Azimuth_0.4.3
[27] SeuratObject_4.0.2
[28] Seurat_4.0.4
[29] shinyBS_0.61

loaded via a namespace (and not attached): [1] utf8_1.2.2 shinydashboard_0.7.2
[3] reticulate_1.22 tidyselect_1.1.1
[5] htmlwidgets_1.5.4 grid_4.1.1
[7] BiocParallel_1.26.2 Rtsne_0.15
[9] ScaledMatrix_1.0.0 munsell_0.5.0
[11] codetools_0.2-18 ica_1.0-2
[13] statmod_1.4.36 DT_0.19
[15] future_1.22.1 miniUI_0.1.1.1
[17] withr_2.4.2 colorspace_2.0-2
[19] ggalluvial_0.12.3 rstudioapi_0.13
[21] ROCR_1.0-11 tensor_1.5
[23] listenv_0.8.0 labeling_0.4.2
[25] GenomeInfoDbData_1.2.6 polyclip_1.10-0
[27] bit64_4.0.5 farver_2.1.0
[29] pheatmap_1.0.12 parallelly_1.28.1
[31] vctrs_0.3.8 generics_0.1.0
[33] R6_2.5.1 doParallel_1.0.16
[35] ggbeeswarm_0.6.0 rsvd_1.0.5
[37] VGAM_1.1-5 locfit_1.5-9.4
[39] hdf5r_1.3.4 bitops_1.0-7
[41] spatstat.utils_2.2-0 DelayedArray_0.18.0
[43] assertthat_0.2.1 promises_1.2.0.1
[45] googlesheets4_1.0.0 beeswarm_0.4.0
[47] gtable_0.3.0 beachmat_2.8.1
[49] globals_0.14.0 goftest_1.2-3
[51] rlang_0.4.11 GlobalOptions_0.1.2
[53] splines_4.1.1 lazyeval_0.2.2
[55] gargle_1.2.0 spatstat.geom_2.2-2
[57] BiocManager_1.30.16 reshape2_1.4.4
[59] abind_1.4-5 httpuv_1.6.3
[61] SeuratDisk_0.0.0.9019 tools_4.1.1
[63] cubature_2.0.4.2 ellipsis_0.3.2
[65] spatstat.core_2.3-0 RColorBrewer_1.1-2
[67] ggridges_0.5.3 Rcpp_1.0.7
[69] sparseMatrixStats_1.4.2 zlibbioc_1.38.0
[71] RCurl_1.98-1.5 rpart_4.1-15
[73] deldir_1.0-2 viridis_0.6.1
[75] pbapply_1.5-0 cowplot_1.1.1
[77] zoo_1.8-9 ggrepel_0.9.1
[79] cluster_2.1.2 fs_1.5.0
[81] magrittr_2.0.1 data.table_1.14.2
[83] scattermore_0.7 SparseM_1.81
[85] lmtest_0.9-38 RANN_2.6.1
[87] googledrive_2.0.0 truncdist_1.0-2
[89] fitdistrplus_1.1-6 gsl_2.1-7
[91] patchwork_1.1.1 shinyjs_2.0.0
[93] mime_0.12 xtable_1.8-4
[95] gridExtra_2.3 shape_1.4.6
[97] compiler_4.1.1 tibble_3.1.5
[99] KernSmooth_2.23-20 crayon_1.4.1
[101] htmltools_0.5.2 mgcv_1.8-36
[103] later_1.3.0 tidyr_1.1.4
[105] powerTCR_1.12.0 DBI_1.1.1
[107] MASS_7.3-54 Matrix_1.3-4
[109] permute_0.9-5 cli_3.0.1
[111] evd_2.3-3 igraph_1.2.6
[113] pkgconfig_2.0.3 plotly_4.10.0
[115] spatstat.sparse_2.0-0 foreach_1.5.1
[117] vipor_0.4.5 stringdist_0.9.8
[119] stringr_1.4.0 digest_0.6.28
[121] sctransform_0.3.2 RcppAnnoy_0.0.19
[123] vegan_2.5-7 spatstat.data_2.1-0
[125] cellranger_1.1.0 leiden_0.3.9
[127] uwot_0.1.10 DelayedMatrixStats_1.14.3 [129] evmix_2.12 shiny_1.7.1
[131] lifecycle_1.0.1 nlme_3.1-152
[133] jsonlite_1.7.2 BiocNeighbors_1.10.0
[135] viridisLite_0.4.0 fansi_0.5.0
[137] pillar_1.6.3 lattice_0.20-44
[139] fastmap_1.1.0 httr_1.4.2
[141] survival_3.2-11 glue_1.4.2
[143] png_0.1-7 iterators_1.0.13
[145] bit_4.0.4 presto_1.0.0
[147] stringi_1.7.5 BiocSingular_1.8.1
[149] irlba_2.3.3 future.apply_1.8.1

ncborcherding commented 2 years ago

Hey Leeana,

Thanks for the information - I was able to work out what I think is happening (best guess of course)

clonalDiversity() automatically does downsampling to avoid any bias in immune repertoire pools based on size. What this actually does is identify the smallest pool and then randomly samples all groups to that size (this does this 100 times). So in fact we would expect a difference if we are comparing a different set of pools - where pool set 1 is a subset of just T1D and pool 2 is both CTL and T1D cells - the calculations will ultimately based on the smallest TCR pool in each set.

This gets to a second point and why I think the data looks a little weird. I would check the number of clones in each of your groups (CD4 Naive, CD4 proliferating, etc) as I think there are cell populations with very low clonotypes and that is throwing off the estimate (my money is on NK cell clusters based on experience).

Let me know if that helps at all - I will leave the issue open for now. I also just commited a quick update to drop the random outlier black dots for the boxplot and make all the metrics into a single line so it looks better.

Thanks, Nick

ncborcherding commented 2 years ago

Hey Leeana,

I am going to close this, but feel free to reach out below or via email for follow-up.

Thanks, Nick