Interpretation of differential enrichment + visualizations

satkinson0115 commented 1 week ago

Hello,

I ran run_escape on a Single cell dataset processed in Seurat. I then normalized using performNormalization and used FindMarkers to find pathways specifically enriched in my cell type of interest (see code below):

c <- runEscape(c, method = "ssGSEA", gene.sets = final_mouse_GO,
                                   groups = 1000, min.size = 5, 
                   new.assay.name = "escape.ssGSEA",
                       BPPARAM = SnowParam(workers = 15))
c_gseaNorm <- performNormalization(c,
                              assay = "escape.ssGSEA",
                              gene.sets = final_mouse_GO,
                              make.positive = TRUE)

ec.markers.fullNorm <- FindMarkers(c_gseaNorm, 
                             assay = "escape.ssGSEA_normalized", 
                             min.pct = 0,
                             logfc.threshold = 1,
                             group.by = "cell_types",
                             ident.1 = "EC")

When I look at the ec.markers.fullNorm table I see that GOBP-VACUOLE-ORGANIZATION has an average L2FC of 292.99. My first question is, does that avg_L2FC make sense or has something gone wrong? That's is quite high.

When I look at this pathway with a ridgeEnrichment plot (using normalized values) the median value for my EC cells is less than all the other cell types.

ridgeEnrichment(c_gseaNorm, assay = "escape.ssGSEA_normalized", 
                gene.set = "GOBP-VACUOLE-ORGANIZATION")

With a L2FC that high I expected the enrichment scores of the EC group to be distributed well higher than the other groups. Am I interpreting how the differential expression works wrong? Or interpreting the plot wrong?

When I use non-normalized values to look at the ridgeEnrichment plot the distribution is more on par with what I expect (with the exception of the top cell type):

A heatmap of the pathway also shows the opposite trend of what I'd expect (if the pathway is 292 L2FC higher than other cells it should be the brightest color in the heatmap):

heatmapEnrichment(c_gseaNorm, 
                  group.by = "cell_types",
                  gene.set.use = "GOBP-VACUOLE-ORGANIZATION",
                  assay = "escape.ssGSEA_normalized") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

When I look at the pathway with the lowest L2FC (thinking that maybe the FindMarkers was backwards, i.e. everything vs EC) I see the same type of pattern as before, which would match with the pathway being down regulated in EC cells:

heatmapEnrichment(c_gseaNorm, 
                  group.by = "cell_types",
                  gene.set.use = "GOBP-RESPONSE-TO-METAL-ION",
                  assay = "escape.ssGSEA_normalized") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Any insight into the best way to interpret this or ideas on what went wrong would be greatly appreciated.

Thank you! Samantha

> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidyr_1.3.1                 ggplot2_3.5.1               RColorBrewer_1.1-3         
 [4] Seurat_5.1.0                SeuratObject_5.0.2          sp_2.1-4                   
 [7] scran_1.32.0                scuttle_1.14.0              SingleCellExperiment_1.26.0
[10] SummarizedExperiment_1.34.0 Biobase_2.64.0              GenomicRanges_1.56.1       
[13] GenomeInfoDb_1.40.1         IRanges_2.38.0              S4Vectors_0.42.0           
[16] BiocGenerics_0.50.0         MatrixGenerics_1.16.0       matrixStats_1.3.0          
[19] escape_2.0.0               

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22          splines_4.4.1             later_1.3.2              
  [4] tibble_3.2.1              R.oo_1.27.0               polyclip_1.10-6          
  [7] graph_1.82.0              XML_3.99-0.17             fastDummies_1.7.3        
 [10] lifecycle_1.0.4           edgeR_4.2.1               globals_0.16.3           
 [13] lattice_0.22-6            MASS_7.3-60.2             ggdist_3.3.2             
 [16] magrittr_2.0.3            limma_3.60.4              plotly_4.10.4            
 [19] metapod_1.12.0            httpuv_1.6.15             sctransform_0.4.1        
 [22] spam_2.10-0               spatstat.sparse_3.1-0     reticulate_1.38.0        
 [25] cowplot_1.1.3             pbapply_1.7-2             DBI_1.2.3                
 [28] pkgload_1.4.0             abind_1.4-5               zlibbioc_1.50.0          
 [31] Rtsne_0.17                presto_1.0.0              purrr_1.0.2              
 [34] R.utils_2.12.3            msigdbr_7.5.1             GenomeInfoDbData_1.2.12  
 [37] ggrepel_0.9.5             irlba_2.3.5.1             listenv_0.9.1            
 [40] spatstat.utils_3.0-5      GSVA_1.52.3               goftest_1.2-3            
 [43] RSpectra_0.16-1           dqrng_0.4.1               spatstat.random_3.2-3    
 [46] annotate_1.82.0           fitdistrplus_1.1-11       parallelly_1.37.1        
 [49] DelayedMatrixStats_1.26.0 leiden_0.4.3.1            codetools_0.2-20         
 [52] DelayedArray_0.30.1       tidyselect_1.2.1          farver_2.1.2             
 [55] UCell_2.8.0               UCSC.utils_1.0.0          ScaledMatrix_1.12.0      
 [58] spatstat.explore_3.2-7    jsonlite_1.8.8            BiocNeighbors_1.22.0     
 [61] progressr_0.14.0          ggridges_0.5.6            survival_3.6-4           
 [64] tools_4.4.1               ica_1.0-3                 Rcpp_1.0.12              
 [67] glue_1.7.0                gridExtra_2.3             SparseArray_1.4.8        
 [70] distributional_0.5.0      AUCell_1.26.0             dplyr_1.1.4              
 [73] HDF5Array_1.32.0          withr_3.0.1               BiocManager_1.30.23      
 [76] fastmap_1.2.0             bluster_1.14.0            ggpointdensity_0.1.0     
 [79] rhdf5filters_1.16.0       fansi_1.0.6               digest_0.6.36            
 [82] rsvd_1.0.5                R6_2.5.1                  mime_0.12                
 [85] colorspace_2.1-0          scattermore_1.2           tensor_1.5               
 [88] spatstat.data_3.1-2       RSQLite_2.3.7             R.methodsS3_1.8.2        
 [91] utf8_1.2.4                generics_0.1.3            data.table_1.15.4        
 [94] httr_1.4.7                htmlwidgets_1.6.4         S4Arrays_1.4.1           
 [97] uwot_0.2.2                pkgconfig_2.0.3           gtable_0.3.5             
[100] blob_1.2.4                lmtest_0.9-40             XVector_0.44.0           
[103] htmltools_0.5.8.1         dotCall64_1.1-1           GSEABase_1.66.0          
[106] scales_1.3.0              png_0.1-8                 SpatialExperiment_1.14.0 
[109] rstudioapi_0.16.0         rjson_0.2.23              reshape2_1.4.4           
[112] nlme_3.1-164              rhdf5_2.48.0              cachem_1.1.0             
[115] zoo_1.8-12                stringr_1.5.1             KernSmooth_2.23-24       
[118] parallel_4.4.1            miniUI_0.1.1.1            AnnotationDbi_1.66.0     
[121] pillar_1.9.0              grid_4.4.1                vctrs_0.6.5              
[124] RANN_2.6.1                promises_1.3.0            BiocSingular_1.20.0      
[127] beachmat_2.20.0           xtable_1.8-4              cluster_2.1.6            
[130] magick_2.8.5              locfit_1.5-9.10           cli_3.6.3                
[133] compiler_4.4.1            rlang_1.1.4               crayon_1.5.3             
[136] future.apply_1.11.2       labeling_0.4.3            plyr_1.8.9               
[139] stringi_1.8.4             viridisLite_0.4.2         deldir_2.0-4             
[142] BiocParallel_1.38.0       babelgene_22.9            munsell_0.5.1            
[145] Biostrings_2.72.1         lazyeval_0.2.2            spatstat.geom_3.2-9      
[148] Matrix_1.7-0              RcppHNSW_0.6.0            patchwork_1.2.0          
[151] sparseMatrixStats_1.16.0  bit64_4.0.5               future_1.33.2            
[154] Rhdf5lib_1.26.0           statmod_1.5.0             KEGGREST_1.44.1          
[157] shiny_1.8.1.1             ROCR_1.0-11               igraph_2.0.3             
[160] memoise_2.0.1             bit_4.0.5

ncborcherding commented 6 days ago

Hey @satkinson0115,

Thanks for reaching out and providing an excellent run down of what is going on and your sessionInfo(). I am going to take this so new users can take a look at it.

When I look at the ec.markers.fullNorm table I see that GOBP-VACUOLE-ORGANIZATION has an average L2FC of 292.99. My first question is, does that avg_L2FC make sense or has something gone wrong? That's is quite high.

No that does not make sense to me as that is way higher than I would expect. Although you are comparing different cell types, so maybe if you are looking at epithelial cells vs neutrophils (which will have super low counts)? But if the counts are relatively the same or same order of magnitude that seems very high.

Are there NA cell_types that are not being plotted?

With a L2FC that high I expected the enrichment scores of the EC group to be distributed well higher than the other groups. Am I interpreting how the differential expression works wrong? Or interpreting the plot wrong?

No you are interpreting correctly - with a high log FC, you would expect the EC to be far shifted to the right.

From the normalized data perspective, my guess is that the EC cells have a higher level of features/counts and that is responsible for your difference in normalized vs unnormalized enrichment:

Normalized:

Unnormalized:

The plotting of the normalized values with the ridgeplot or heatmap look to be the same trend. What does not make sense it the logFC values you are getting. I am happy to troubleshoot with you, honestly I do not have an answer. If you want to email me your seurat object or a sample of it, I can try to debug.

Sorry I do not have a better solution at the moment.

Nick

satkinson0115 commented 6 days ago

Hi @ncborcherding,

Thank you so much for responding! I'm glad that my understanding of what was supposed to be happening was accurate and appreciate any help troubleshooting you're willing to give. I'm emailing you the Seurat object presently.

You're correct that the Epithelium has the majority of the cell counts (7400), but nothing is "super low" in my opinion. My lowest frequency for this dataset is 719 cells in a cell type. While it's 10x lower than epithelium, it's nothing to sneeze at either (in my opinion at least, I could be wrong).

Per your question about NA cell types, no there aren't any NA cell types. From my understanding of FindMarkers, when I set ident.1 = "EC" I should be getting differentially enriched pathways of EC compared to all the other cell types combined, yes?

Thanks again, Samantha

ncborcherding commented 6 hours ago

Hey @satkinson0115,

I did make some progress when it comes to looking at your data. It is in fact due to the built-in normalization function when the default scale.factor is used Here is what happens when I perform a similar analysis on the example on the site. Im using this becuase I had to run the GSEA calculation multiple times and your data set is huge.

scRep_example <- performNormalization(scRep_example, 
                                      assay = "escape.ssGSEA", 
                                      gene.sets = GS.hallmark)

all.markers <- FindAllMarkers(scRep_example, 
                              assay = "escape.ssGSEA_normalized", 
                              min.pct = 0,
                              logfc.threshold = 0)

head(all.markers)

HALLMARK-REACTIVE-OXYGEN-SPECIES-PATHWAY 1.831207e-39 -10.8060011 0.994 0.927 9.156035e-38 1 HALLMARK-APICAL-JUNCTION 2.772834e-25 -21.6672490 0.983 0.921 1.386417e-23 1 HALLMARK-IL2-STAT5-SIGNALING 3.733223e-21 -Inf 1.000 1.000 1.866612e-19 1 HALLMARK-INFLAMMATORY-RESPONSE 2.230958e-18 -786.0173530 1.000 1.000 1.115479e-16 1 HALLMARK-DNA-REPAIR 8.666808e-16 -0.5134797 0.985 0.945 4.333404e-14 1 HALLMARK-ALLOGRAFT-REJECTION 6.321843e-15 -203.3025383 1.000 1.000 3.160921e-13 1

For now, I think you can get around this by setting your scale.factor equal to the nFeatures:

c_gseaNorm <- performNormalization(c,
                              assay = "escape.ssGSEA",
                              gene.sets = final_mouse_GO,
                              scale.factor = c$nFeature_RNA)

I will work on the bug fix maybe/hopefully this week and post here when I get things fixed.

Thanks again for all your help!

Nick

ncborcherding / escape

Interpretation of differential enrichment + visualizations #131