Closed samgest closed 9 months ago
Hi, are you using the latest version of the software? send us a sessionInfo()
On Tue, Dec 12, 2023, 04:45 samgest @.***> wrote:
Hi,
I am performing an analysis with several published datasets of kidney tumor (that is, raw UMI counts coming from public repositories) and I wanted to integrate them, but I'm having some issues with overcorrection.
I have 68 tumor samples coming from 68 different patients and, of some of them, I also have a sample coming from the surrounding healthy tissue (normal). In total, 68 tumor + 19 normal = 87 total scRNA-seq expression matrices (coming from 9 different datasets). I want to integrate all of this data together and remove the batch effect (sample-wise) and the dataset bias (i.e., the bias that arises from using different datasets), but not the tumor-normal differences.
I tried to integrate with RunHarmony as:
dataMerged.ref <- dataMerged.ref %>% RunHarmony(group.by.vars = c("dataset_id", "sample_id"), plot_convergence = TRUE)
But the results are quite overcorrected (Fig. 1). Despite of the fact that tumor and normal tissues should share some cell types (such as macrophages, lymphocytes, etc.), the gross bulk of cells should be different.
Rplot05.png (view on web) https://github.com/immunogenomics/harmony/assets/150608196/58cf9a4b-d95a-4a4d-aee4-a6f6b1bd7ed3
Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's IntegrateEmbeddings function?
I'm quite new to scRNA-seq analysis, so any comment or suggestion is more than appreciated. Thanks in advance.
— Reply to this email directly, view it on GitHub https://github.com/immunogenomics/harmony/issues/230, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSFW2C5K2ZPT7K7VUGNXVTYJARVPAVCNFSM6AAAAABARFBBQWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTOMZWHA4TKNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
There you go:
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Madrid
tzcode source: internal
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] copykat_1.1.0 cutoff.scATOMIC_0.1.0 agrmt_1.42.12 Rmagic_2.0.3
[5] caret_6.0-94 lattice_0.22-5 randomForest_4.7-1.1 plyr_1.8.9
[9] scATOMIC_2.0.2 clustree_0.5.1 ggraph_2.1.0 data.table_1.14.10
[13] reshape2_1.4.4 DESeq2_1.42.0 GSVA_1.50.0 BaseSet_0.9.0
[17] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.26.0 AnnotationFilter_1.26.0 GenomicFeatures_1.54.1
[21] reticulate_1.34.0 GSEABase_1.64.0 graph_1.80.0 annotate_1.80.0
[25] XML_3.99-0.16 AnnotationDbi_1.64.1 HGNChelper_0.8.1 openxlsx_4.2.5.2
[29] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 purrr_1.0.2
[33] readr_2.1.4 tidyr_1.3.0 ggplot2_3.4.4 tidyverse_2.0.0
[37] tibble_3.2.1 dplyr_1.1.4 patchwork_1.1.3 pheatmap_1.0.12
[41] SingleR_2.4.0 celldex_1.12.0 SummarizedExperiment_1.32.0 Biobase_2.62.0
[45] GenomicRanges_1.54.1 GenomeInfoDb_1.38.1 IRanges_2.36.0 S4Vectors_0.40.2
[49] BiocGenerics_0.48.1 MatrixGenerics_1.14.0 matrixStats_1.1.0 rhdf5_2.46.1
[53] Matrix_1.6-4 harmony_1.2.0 Rcpp_1.0.11 Seurat_5.0.1
[57] SeuratObject_5.0.1 sp_2.1-2
loaded via a namespace (and not attached):
[1] ProtGenerics_1.34.0 spatstat.sparse_3.0-3 bitops_1.0-7
[4] httr_1.4.7 RColorBrewer_1.1-3 tools_4.3.2
[7] sctransform_0.4.1 utf8_1.2.4 R6_2.5.1
[10] HDF5Array_1.30.0 lazyeval_0.2.2 uwot_0.1.16
[13] rhdf5filters_1.14.1 withr_2.5.2 prettyunits_1.2.0
[16] gridExtra_2.3 progressr_0.14.0 cli_3.6.1
[19] spatstat.explore_3.2-5 fastDummies_1.7.3 spatstat.data_3.0-3
[22] ggridges_0.5.4 pbapply_1.7-2 Rsamtools_2.18.0
[25] parallelly_1.36.0 rstudioapi_0.15.0 RSQLite_2.3.4
[28] generics_0.1.3 BiocIO_1.12.0 ica_1.0-3
[31] spatstat.random_3.2-2 zip_2.3.0 fansi_1.0.6
[34] clipr_0.8.0 abind_1.4-5 lifecycle_1.0.4
[37] yaml_2.3.7 recipes_1.0.8 SparseArray_1.2.2
[40] BiocFileCache_2.10.1 Rtsne_0.17 grid_4.3.2
[43] blob_1.2.4 promises_1.2.1 ExperimentHub_2.10.0
[46] crayon_1.5.2 miniUI_0.1.1.1 beachmat_2.18.0
[49] cowplot_1.1.1 KEGGREST_1.42.0 pillar_1.9.0
[52] rjson_0.2.21 future.apply_1.11.0 codetools_0.2-19
[55] leiden_0.4.3.1 glue_1.6.2 vctrs_0.6.5
[58] png_0.1-8 spam_2.10-0 gtable_0.3.4
[61] cachem_1.0.8 gower_1.0.1 prodlim_2023.08.28
[64] S4Arrays_1.2.0 mime_0.12 tidygraph_1.2.3
[67] survival_3.5-7 timeDate_4022.108 SingleCellExperiment_1.24.0
[70] iterators_1.0.14 hardhat_1.3.0 lava_1.7.3
[73] interactiveDisplayBase_1.40.0 ellipsis_0.3.2 fitdistrplus_1.1-11
[76] ipred_0.9-14 ROCR_1.0-11 nlme_3.1-164
[79] bit64_4.0.5 progress_1.2.3 filelock_1.0.3
[82] RcppAnnoy_0.0.21 rprojroot_2.0.4 irlba_2.3.5.1
[85] rpart_4.1.23 KernSmooth_2.23-22 colorspace_2.1-0
[88] DBI_1.1.3 nnet_7.3-19 tidyselect_1.2.0
[91] bit_4.0.5 compiler_4.3.2 curl_5.2.0
[94] xml2_1.3.6 DelayedArray_0.28.0 plotly_4.10.3
[97] rtracklayer_1.62.0 scales_1.3.0 lmtest_0.9-40
[100] rappdirs_0.3.3 digest_0.6.33 goftest_1.2-3
[103] spatstat.utils_3.0-4 XVector_0.42.0 htmltools_0.5.7
[106] pkgconfig_2.0.3 sparseMatrixStats_1.14.0 dbplyr_2.4.0
[109] fastmap_1.1.1 rlang_1.1.2 htmlwidgets_1.6.4
[112] shiny_1.8.0 DelayedMatrixStats_1.24.0 farver_2.1.1
[115] zoo_1.8-12 jsonlite_1.8.8 BiocParallel_1.36.0
[118] ModelMetrics_1.2.2.2 BiocSingular_1.18.0 RCurl_1.98-1.13
[121] magrittr_2.0.3 GenomeInfoDbData_1.2.11 dotCall64_1.1-1
[124] Rhdf5lib_1.24.0 munsell_0.5.0 viridis_0.6.4
[127] pROC_1.18.5 stringi_1.8.2 zlibbioc_1.48.0
[130] MASS_7.3-60 AnnotationHub_3.10.0 listenv_0.9.0
[133] ggrepel_0.9.4 deldir_2.0-2 graphlayouts_1.0.2
[136] Biostrings_2.70.1 splines_4.3.2 tensor_1.5
[139] hms_1.1.3 locfit_1.5-9.8 igraph_1.5.1
[142] spatstat.geom_3.2-7 RcppHNSW_0.5.0 biomaRt_2.58.0
[145] ScaledMatrix_1.10.0 BiocVersion_3.18.1 BiocManager_1.30.22
[148] foreach_1.5.2 tweenr_2.0.2 tzdb_0.4.0
[151] httpuv_1.6.13 RANN_2.6.1 polyclip_1.10-6
[154] future_1.33.0 scattermore_1.2 ggforce_0.4.1
[157] rsvd_1.0.5 xtable_1.8-4 restfulr_0.0.15
[160] RSpectra_0.16-1 later_1.3.2 class_7.3-22
[163] viridisLite_0.4.2 memoise_2.0.1 GenomicAlignments_1.38.0
[166] cluster_2.1.6 timechange_0.2.0 globals_0.16.2
[169] here_1.0.1
Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's IntegrateEmbeddings function?
Yes, that is correct. If you do sample level correction, it basically corrects the latent dimensions for everything that is nested in that experimental design. Performing the correction separately, as you suggest, would be the way to go.
You can do, however, cell abundance investigation within the tumor and normal kidney, which is fine to do this way.
A minor comment in your workflow is that if you decide to use only 1:10 latent variables, perform harmony just on those. Not sure how much it will change things but it may be raising issues with the curse of dimensionality.
Hi,
I am performing an analysis with several published datasets of kidney tumor (that is, raw UMI counts coming from public repositories) and I wanted to integrate them, but I'm having some issues with overcorrection.
I have 68 tumor samples coming from 68 different patients and, from some of them, I also have a sample coming from the surrounding healthy tissue (normal). In total, 68 tumor + 19 normal = 87 total scRNA-seq expression matrices (coming from 9 different datasets). I want to integrate all of this data together and remove the batch effect (sample-wise) and the dataset bias (i.e., the bias that arises from using different datasets), but not the tumor-normal differences.
I tried to integrate with
RunHarmony
as:NOTE: I took the first 10 dimensions based on their standard deviation and where it "plateaus" (Fig. 1).
Fig. 1
But the results are quite overcorrected. Despite of the fact that tumor and normal tissues should share some cell types (such as lymphocytes, endothelial cells, etc.), there should be at least a big cluster of cells in the tumor samples that should not be present in the normal ones (the malignant / tumoral cells themselves). I see very little difference in the UMAP graph (Fig. 2):
Fig. 2
Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's
IntegrateEmbeddings
function?I'm quite new to scRNA-seq analysis, so any comment or suggestion is more than appreciated. Thanks in advance.