Significance Testing in Cell-Level

cagataysahinn commented 8 months ago

Hi Nick,

I hope you're doing well. Firstly, I want to commend you on the excellent work with the package - it's been incredibly useful!

I'm reaching out with a question regarding significance testing within gene sets at the cell level, rather than between different samples as getsignificance currently operates. To explore this, I conducted a test where I generated random gene sets and applied EnrichIt to them. My expectation was to observe a mean value around 0 for the score, but instead, it averaged around 2000. What would be the reasond, do you have any idea?

Additionally, I've plotted the 0.95, 0.5 and 0.05 quantile values on a graph to visually represent the data. I'm considering interpreting values above 0.95 as significantly positively enriched and those below 0.05 as negatively enriched. Quantile values showed with the 3 blue lines. Would you say this approach is reasonable?

Here's the graph I've generated: escapeissuerandomgenesetenrichemnt X axis shows the clusters while y axis shows the samples and enrichemnt scores. Red dots shows the ones with lower enrichemnt score than 0.05 quantile score of random geneset.

Best regards, Çağatay

ncborcherding commented 8 months ago

Hey Çağatay,

Super interesting idea - so you have generated random counts and gene sets? I think the method for the original generation of the data will be the major contribution to your observed trends. Each of the set of boxplots are different gene sets and the individual boxes are samples or clusters?

GSEA and ssGSEA rely on a walk across the ranked genes and report the point of maximal value (check out more here), so I do not think I would expect a mean value of 0 because of the fact the maximal value is the enrichment value.

For real count data - I think you can analyze using the quantiles system you have set up, but I would do that at the individual gene set levels and not across all gene sets.

cagataysahinn commented 8 months ago

Hello again Nick,

Thank you for your reply!

Data Generation Method: The original data utilized in this analysis was obtained from the 10x platform and processed using Seurat. It is normalized. Each of the set of boxplots are different clusters and individual boxes are samples. We wanted to see whether the exact enrichment score to the cells of the indivdual clusters are significant or not.

Expectations from Random Gene Sets: What I understand from you is because the algorithms focus on identifying the maximal enrichment value, the mean value across all genes within the set is not necessarily expected to be zero. Instead, it's influenced by the presence of the maximal enrichment value. I hope I am correct.

In our case, to enhance robustness, multiple "negative control" gene sets were indeed generated. To make that I created 500 random gene sets containing 100 random genes per every gene set. As it happens in two sample GSEA, when there's no real enrichment, this statistic fluctuates randomly around zero as it moves through the ranked gene list. So, that’s why we expected to see zero. In the graph below, boxplots are generated in different clusters separately. data points signify individual cells and the boxplot shows overall distribution. Each boxplot generated along the x-axis was plotted using different gene sets. negative control gene sets are on the right side of the line. hallmark pathways (50 of them are plotted on the left). We applied quantile function to the randomly generated gene sets' scores to get the quantile values.

Ekran görüntüsü 2024-03-26 134515

ncborcherding commented 8 months ago

Very interesting work and thanks for following up. There are a couple thoughts I had:

I think likely the enrichment score based on maximum walk value is what explains the higher than expected enrichment scores
The GSEA/GSVA process itself is based on the count-level data - so heads up that the normalized values aren't used and can possibly be prone to issues with integrated/multiple samples.
I would play around with changing the size of the random gene sets, I think there is a sweet spot between 5-10 and 100 that is probably the ideal size for a gene set. Albeit that might be anecdotal.

Nick

cagataysahinn commented 7 months ago

Thank you Nick!

Then, do you think defining 97.5% and 2.5% quartiles of the random gene sets (5% of the data when combined, two tailed) as the significance thresholds make sense?
Can you also help me understand whether your algorithm uses the same approach as the original GSEA algorithm where the running score declines steadily across the ranked input unless "gene hits" are encountered?

Çağatay

ncborcherding commented 7 months ago

Hey Çağatay,

I think using the 97.5 and 2.5 threshold makes sense.

The GSEA in escape is actually single sample GSEA (ssGSEA) from Barbie et al. The major underlying difference is that the enrichment score is calculated per sample, instead of by phenotype label.

Nick

ncborcherding / escape

Significance Testing in Cell-Level #91