Interpretation and improving the sc.tl.score_genes functionality

Elhl93 commented 3 years ago

[ ] New analysis tool: Calculate enrichment of a pathway across Conditions
[ ] New plotting function: Distribution of gene-set score across Conditions.

Hi,

I really like your implementation of sc.tl.score_genes, which enables to extract biological information based on prior knowledge or to follow-up on genes in the DEgene analysis.

I ran the function with different gene sets on my dataset. The Conditions (of one cell-type) are colored in the density plots below. Currently I am not sure with the interpretation of the results, here I would like to hear your thoughts and maybe some improvements. In both cases >30 genes with medium/high expression are included.

In Plot1, gene-set A is enriched in Condition „green“ compared to the other conditions by comparing the medians. The scores are slightly negative, so I assume that the background-set is higher abundant in those Conditions. Which might indicate that gene-set B is depleted in the other conditions.

Plot1

In Plot 2, we see that gene-set B is depleted in purple. There might be however a slight enrichment in the other Conditions.

Plot2

Can we infer from such an analysis how much a pathway is upregulated? (e.g. by calculating the FC of the mean?) . It would be great to conclude for example, that Pathway X is 30% more active, in condition Y.
How does in your opinion class-imbalance affect the analysis? For example, Condition A has 10 samples, while for Condition B,C.. I only have 3 each?
I am happy to provide the code for the density distributions to visualise the results of the gene-set-score function.

Looking forward to hear your thoughts!

giovp commented 3 years ago

hey, thanks for the interest and very good questions, my 2 cents:

Can we infer from such an analysis how much a pathway is upregulated? (e.g. by calculating the FC of the mean?) . It would be great to conclude for example, that Pathway X is 30% more active, in condition Y.

I think you could yes, maybe complementary to some standard approaches like hypergeometric test?

How does in your opinion class-imbalance affect the analysis? For example, Condition A has 10 samples, while for Condition B,C.. I only have 3 each?

since you have densitieis, it should be ok (?). you could also try subsampling the condition where you have more samples n times

I am happy to provide the code for the density distributions to visualise the results of the gene-set-score function.

thanks, very much appreciated! Actually I don't think we have really a class/example of density/line plots in scanpy. Not sure if it can be of broad use/scope.

pinging @dawe (original author of the function).

dawe commented 3 years ago

Hi, sorry for the late response...

Can we infer from such an analysis how much a pathway is upregulated? (e.g. by calculating the FC of the mean?) . It would be great to conclude for example, that Pathway X is 30% more active, in condition Y.

I think so. Historically, this function has been used to score cell cycle and, in that case, one can say that cells are in a specific state because of a different distribution of signatures. This is generally true. I have myself used the score to underline cells with activated/depleted pathways. Also, I have used gene lists from KEGG or Reactome to score single cells. IMHO, once you have those values you can perform any statistical test on their distributions to tell if there's a difference in activation of a certain pathway. There may be better ways to do this, but it's a start

How does in your opinion class-imbalance affect the analysis? For example, Condition A has 10 samples, while for Condition B,C.. I only have 3 each?

As @giovp pointed out, it should be ok, as long as you have enough cells to estimate the distributions

I am happy to provide the code for the density distributions to visualise the results of the gene-set-score function.

What I usually do is to calculate the embedding_density for signatures, so that it's easy to visualize them on my embeddings (I usually cut values into quartiles).

giovp commented 3 years ago

that's great explanation, thanks @dawe ! @Elhl93 does this resolve the issue for you? If you'd still like to contribute to a PR for plotting let me know! I'll close this for the moment but please feel free to reopen for any issue.

scverse / scanpy

Interpretation and improving the sc.tl.score_genes functionality #1629