performNormalization - Githubissues

st-tky commented 2 months ago

Hi escape team,

Thank you for developing this wonderful package. I am trying to understand the best practices for GSEA workflow, and I have a question regarding normalization. I have an integrated seurat object (20 datasets, 5 different conditions) and want to evaluate gene set enrichments on this dataset. My questions are as follows:

Is the workflow shown in the vignette applicable for integrated datasets?
When I compare the results across conditions before and after performNormalization, the results look so different because of the differences in nFeature values between conditions. Particularly, when I applied normalized scores, all gene sets (I used hallmark) were highly enriched in a specific group with lower nFeature/nCount and less enriched in groups with higher nFeature/nCount. Does it make sense? And, should I trust normalized results, not non-normalized ones, even in such an extreme case?
Should I perform the normalization step again if I subset some cells from the whole dataset? When I compared the results, they were different with or without re-normalization after the subsetting.

I would appreciate any suggestions or advice.

ncborcherding commented 2 months ago

Hey @st-tky,

Thanks for reaching out.

All of the methods for gene set enrichment use raw count data for the calculation, so although the example data is integrated, that integrated assay is not used for the underlying calculation.
The assumption for normalization is that the enrichment score is affected by the number of features per cell. The lower the number of features, the less likely any gene from a particular gene set will appear and the lower the enrichment score. So there may be preferential weighting of pathways towards cells with lower features when normalizing with performNormalization() or the internal normalization in escape.matrix() or runEscape(). I think you will need to examine the data and see if the normalized results make sense or the quality of those low feature/count cells is adequate. There is no best practice here, but it is a very good question.
The normalization methods provided in escape should be cell-specific, so there is no need to re-run the normalization.

Hope that helps and let me know if you have any questions,

Nick

st-tky commented 2 months ago

Hi Nick,

Thank you so much for your very kind and detailed explanations. I performed all four implemented methods, AUCell, GSVA, ssGSEA, and UCell with internal normalization in runEscape(). The results varied between the methods, so I understood that we needed to choose the "better" method(s) depending on our data.

I have also performed SCpubr::do_EnrichmentHeatmap, AddModuleScore_UCell, and AddModuleScore using different packages on the same datasets, which also showed different results. I noticed that the "maxRank" parameter in UCell should be an important parameter to adjust depending on the datasets, particularly with smaller numbers of expressed genes, because many genes that are not expressed in scRNA-seq data might introduce a kind of bias into GSEA analysis. Thus, I guess it might be a good idea to add this maxRank parameter into runescape() when we use method="UCell" (if not yet).

Anyway, thank you again for your help!

Flu09 commented 1 month ago

May I ask how did you choose the better method depending on the data?

st-tky commented 1 month ago

Hi @Flu09, I decided to use the other workflow shown here (https://crazyhottommy.github.io/scRNA-seq-workshop-Fall-2019/scRNAseq_workshop_3.html). Because I wanted to compare the same cluster between different conditions, I assumed this "conventional pre-ranked GSEA" could be sufficient and straightforward.

ncborcherding / escape

performNormalization #118