Integration of patient-matched tumor and normal tissue samples

samgest commented 9 months ago

Hi,

I am performing an analysis with several published datasets of kidney tumor (that is, raw UMI counts coming from public repositories) and I wanted to integrate them, but I'm having some issues with overcorrection.

I have 68 tumor samples coming from 68 different patients and, from some of them, I also have a sample coming from the surrounding healthy tissue (normal). In total, 68 tumor + 19 normal = 87 total scRNA-seq expression matrices (coming from 9 different datasets). I want to integrate all of this data together and remove the batch effect (sample-wise) and the dataset bias (i.e., the bias that arises from using different datasets), but not the tumor-normal differences.

I tried to integrate with RunHarmony as:

dataMerged <- dataMerged %>% 
  RunHarmony(group.by.vars = c("dataset_id", "sample_id"), plot_convergence = TRUE) 
  # Note: "sample_type" metadata not included in the "group.by.vars" argument since that's the variable I don't want to correct.

  RunUMAP(reduction = "harmony", dims = 1:10)

NOTE: I took the first 10 dimensions based on their standard deviation and where it "plateaus" (Fig. 1).

Fig. 1 Rplot02

But the results are quite overcorrected. Despite of the fact that tumor and normal tissues should share some cell types (such as lymphocytes, endothelial cells, etc.), there should be at least a big cluster of cells in the tumor samples that should not be present in the normal ones (the malignant / tumoral cells themselves). I see very little difference in the UMAP graph (Fig. 2):

Fig. 2 Rplot05

Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's IntegrateEmbeddings function?

I'm quite new to scRNA-seq analysis, so any comment or suggestion is more than appreciated. Thanks in advance.

pati-ni commented 9 months ago

Hi, are you using the latest version of the software? send us a sessionInfo()

On Tue, Dec 12, 2023, 04:45 samgest @.***> wrote:

Hi,

I am performing an analysis with several published datasets of kidney tumor (that is, raw UMI counts coming from public repositories) and I wanted to integrate them, but I'm having some issues with overcorrection.

I have 68 tumor samples coming from 68 different patients and, of some of them, I also have a sample coming from the surrounding healthy tissue (normal). In total, 68 tumor + 19 normal = 87 total scRNA-seq expression matrices (coming from 9 different datasets). I want to integrate all of this data together and remove the batch effect (sample-wise) and the dataset bias (i.e., the bias that arises from using different datasets), but not the tumor-normal differences.

I tried to integrate with RunHarmony as:

dataMerged.ref <- dataMerged.ref %>% RunHarmony(group.by.vars = c("dataset_id", "sample_id"), plot_convergence = TRUE)

But the results are quite overcorrected (Fig. 1). Despite of the fact that tumor and normal tissues should share some cell types (such as macrophages, lymphocytes, etc.), the gross bulk of cells should be different.

Rplot05.png (view on web) https://github.com/immunogenomics/harmony/assets/150608196/58cf9a4b-d95a-4a4d-aee4-a6f6b1bd7ed3

Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's IntegrateEmbeddings function?

I'm quite new to scRNA-seq analysis, so any comment or suggestion is more than appreciated. Thanks in advance.

— Reply to this email directly, view it on GitHub https://github.com/immunogenomics/harmony/issues/230, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSFW2C5K2ZPT7K7VUGNXVTYJARVPAVCNFSM6AAAAABARFBBQWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTOMZWHA4TKNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

samgest commented 9 months ago

There you go:

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] copykat_1.1.0               cutoff.scATOMIC_0.1.0       agrmt_1.42.12               Rmagic_2.0.3               
 [5] caret_6.0-94                lattice_0.22-5              randomForest_4.7-1.1        plyr_1.8.9                 
 [9] scATOMIC_2.0.2              clustree_0.5.1              ggraph_2.1.0                data.table_1.14.10         
[13] reshape2_1.4.4              DESeq2_1.42.0               GSVA_1.50.0                 BaseSet_0.9.0              
[17] EnsDb.Hsapiens.v86_2.99.0   ensembldb_2.26.0            AnnotationFilter_1.26.0     GenomicFeatures_1.54.1     
[21] reticulate_1.34.0           GSEABase_1.64.0             graph_1.80.0                annotate_1.80.0            
[25] XML_3.99-0.16               AnnotationDbi_1.64.1        HGNChelper_0.8.1            openxlsx_4.2.5.2           
[29] lubridate_1.9.3             forcats_1.0.0               stringr_1.5.1               purrr_1.0.2                
[33] readr_2.1.4                 tidyr_1.3.0                 ggplot2_3.4.4               tidyverse_2.0.0            
[37] tibble_3.2.1                dplyr_1.1.4                 patchwork_1.1.3             pheatmap_1.0.12            
[41] SingleR_2.4.0               celldex_1.12.0              SummarizedExperiment_1.32.0 Biobase_2.62.0             
[45] GenomicRanges_1.54.1        GenomeInfoDb_1.38.1         IRanges_2.36.0              S4Vectors_0.40.2           
[49] BiocGenerics_0.48.1         MatrixGenerics_1.14.0       matrixStats_1.1.0           rhdf5_2.46.1               
[53] Matrix_1.6-4                harmony_1.2.0               Rcpp_1.0.11                 Seurat_5.0.1               
[57] SeuratObject_5.0.1          sp_2.1-2                   

loaded via a namespace (and not attached):
  [1] ProtGenerics_1.34.0           spatstat.sparse_3.0-3         bitops_1.0-7                 
  [4] httr_1.4.7                    RColorBrewer_1.1-3            tools_4.3.2                  
  [7] sctransform_0.4.1             utf8_1.2.4                    R6_2.5.1                     
 [10] HDF5Array_1.30.0              lazyeval_0.2.2                uwot_0.1.16                  
 [13] rhdf5filters_1.14.1           withr_2.5.2                   prettyunits_1.2.0            
 [16] gridExtra_2.3                 progressr_0.14.0              cli_3.6.1                    
 [19] spatstat.explore_3.2-5        fastDummies_1.7.3             spatstat.data_3.0-3          
 [22] ggridges_0.5.4                pbapply_1.7-2                 Rsamtools_2.18.0             
 [25] parallelly_1.36.0             rstudioapi_0.15.0             RSQLite_2.3.4                
 [28] generics_0.1.3                BiocIO_1.12.0                 ica_1.0-3                    
 [31] spatstat.random_3.2-2         zip_2.3.0                     fansi_1.0.6                  
 [34] clipr_0.8.0                   abind_1.4-5                   lifecycle_1.0.4              
 [37] yaml_2.3.7                    recipes_1.0.8                 SparseArray_1.2.2            
 [40] BiocFileCache_2.10.1          Rtsne_0.17                    grid_4.3.2                   
 [43] blob_1.2.4                    promises_1.2.1                ExperimentHub_2.10.0         
 [46] crayon_1.5.2                  miniUI_0.1.1.1                beachmat_2.18.0              
 [49] cowplot_1.1.1                 KEGGREST_1.42.0               pillar_1.9.0                 
 [52] rjson_0.2.21                  future.apply_1.11.0           codetools_0.2-19             
 [55] leiden_0.4.3.1                glue_1.6.2                    vctrs_0.6.5                  
 [58] png_0.1-8                     spam_2.10-0                   gtable_0.3.4                 
 [61] cachem_1.0.8                  gower_1.0.1                   prodlim_2023.08.28           
 [64] S4Arrays_1.2.0                mime_0.12                     tidygraph_1.2.3              
 [67] survival_3.5-7                timeDate_4022.108             SingleCellExperiment_1.24.0  
 [70] iterators_1.0.14              hardhat_1.3.0                 lava_1.7.3                   
 [73] interactiveDisplayBase_1.40.0 ellipsis_0.3.2                fitdistrplus_1.1-11          
 [76] ipred_0.9-14                  ROCR_1.0-11                   nlme_3.1-164                 
 [79] bit64_4.0.5                   progress_1.2.3                filelock_1.0.3               
 [82] RcppAnnoy_0.0.21              rprojroot_2.0.4               irlba_2.3.5.1                
 [85] rpart_4.1.23                  KernSmooth_2.23-22            colorspace_2.1-0             
 [88] DBI_1.1.3                     nnet_7.3-19                   tidyselect_1.2.0             
 [91] bit_4.0.5                     compiler_4.3.2                curl_5.2.0                   
 [94] xml2_1.3.6                    DelayedArray_0.28.0           plotly_4.10.3                
 [97] rtracklayer_1.62.0            scales_1.3.0                  lmtest_0.9-40                
[100] rappdirs_0.3.3                digest_0.6.33                 goftest_1.2-3                
[103] spatstat.utils_3.0-4          XVector_0.42.0                htmltools_0.5.7              
[106] pkgconfig_2.0.3               sparseMatrixStats_1.14.0      dbplyr_2.4.0                 
[109] fastmap_1.1.1                 rlang_1.1.2                   htmlwidgets_1.6.4            
[112] shiny_1.8.0                   DelayedMatrixStats_1.24.0     farver_2.1.1                 
[115] zoo_1.8-12                    jsonlite_1.8.8                BiocParallel_1.36.0          
[118] ModelMetrics_1.2.2.2          BiocSingular_1.18.0           RCurl_1.98-1.13              
[121] magrittr_2.0.3                GenomeInfoDbData_1.2.11       dotCall64_1.1-1              
[124] Rhdf5lib_1.24.0               munsell_0.5.0                 viridis_0.6.4                
[127] pROC_1.18.5                   stringi_1.8.2                 zlibbioc_1.48.0              
[130] MASS_7.3-60                   AnnotationHub_3.10.0          listenv_0.9.0                
[133] ggrepel_0.9.4                 deldir_2.0-2                  graphlayouts_1.0.2           
[136] Biostrings_2.70.1             splines_4.3.2                 tensor_1.5                   
[139] hms_1.1.3                     locfit_1.5-9.8                igraph_1.5.1                 
[142] spatstat.geom_3.2-7           RcppHNSW_0.5.0                biomaRt_2.58.0               
[145] ScaledMatrix_1.10.0           BiocVersion_3.18.1            BiocManager_1.30.22          
[148] foreach_1.5.2                 tweenr_2.0.2                  tzdb_0.4.0                   
[151] httpuv_1.6.13                 RANN_2.6.1                    polyclip_1.10-6              
[154] future_1.33.0                 scattermore_1.2               ggforce_0.4.1                
[157] rsvd_1.0.5                    xtable_1.8-4                  restfulr_0.0.15              
[160] RSpectra_0.16-1               later_1.3.2                   class_7.3-22                 
[163] viridisLite_0.4.2             memoise_2.0.1                 GenomicAlignments_1.38.0     
[166] cluster_2.1.6                 timechange_0.2.0              globals_0.16.2               
[169] here_1.0.1

pati-ni commented 9 months ago

Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's IntegrateEmbeddings function?

Yes, that is correct. If you do sample level correction, it basically corrects the latent dimensions for everything that is nested in that experimental design. Performing the correction separately, as you suggest, would be the way to go.

You can do, however, cell abundance investigation within the tumor and normal kidney, which is fine to do this way.

A minor comment in your workflow is that if you decide to use only 1:10 latent variables, perform harmony just on those. Not sure how much it will change things but it may be raising issues with the curse of dimensionality.

immunogenomics / harmony

Integration of patient-matched tumor and normal tissue samples #230