stephenslab / fastTopics

Fast algorithms for fitting topic models and non-negative matrix factorizations to count data.
https://stephenslab.github.io/fastTopics
Other
77 stars 7 forks source link

What is a sensible threshold to select the most informative features per topic? #49

Closed massonix closed 8 months ago

massonix commented 8 months ago

Dear developers of fastTopics,

Thank you for developing such a wonderful package. I wanted to ask what is, in your opinion, the best way to prioritize important features for a given topic after runnin GoM differential expression analysis. The volcano plots look different across the different topics, for example:

image image

What are the thresholds that you found to be the most meaningful? Should the thersholds be the same per topic, or should they be topic-specific?

On a related question: which rank best reflects the relative importance of each feature? Should I rank features just by postmean (LFC), or would you include the z?

Thanks a lot in advance!

pcarbo commented 8 months ago

@massonix Thanks for your interest in fastTopics! The question of what is the "best" ranking is very much an open one; I would start by asking, what does it mean for a gene to be "important" or "meaningful"? One possible answer is that genes that are most "important" are those that show the largest expression increases (largest positive LFCs). But then this would not be a complete answer because you would also want to take into consideration how "significant" the changes are (e.g., z-score or p-value). This points to the advantage of visualizing the DE analysis results, e.g., using a volcano plot, because you can consider both measures—size of the change and the significance of the change—simultaneously.

If you are forced to choose a ranking (e.g., to summarize the results for a paper), then typically the ranking is done by z-score or p-value.

Hope this helps.

massonix commented 8 months ago

thanks, that's very useful!