neurorestore / Augur

Cell type prioritization in single-cell data
MIT License
94 stars 10 forks source link

Increase robustness of cell type prioritization #30

Open tkapello opened 10 months ago

tkapello commented 10 months ago

Hi,

 thank you for your interesting package. In my dataset of ~20,000 cells and 26,897 genes, I found that the tool found 20,000 unique significant genes. I wanted to ask whether this makes sense as the number seems relatively high (>60%). Is there a way one can tailor the analysis to increase the robustness of genes used to prioritize cell types apart from adjusting the trees?

Best, Theo

skinnider commented 10 months ago

Hi Theo - I'm not entirely sure I understand your question. Augur doesn't test for statistical significance but simply returns the feature importance from the random forest algorithm. But there are a number of reasons to take these importances with a grain of salt and if you are interested in identifying statistically significant differences, a conventional differential expression (DE) analysis as implemented in our Libra package might make more sense.

Beyond that, you can set augur_mode='velocity' to disable feature selection and use alternative feature selection methods, or at the expense of a longer runtime you can just run Augur on all genes.

tkapello commented 10 months ago

Sorry, I was not clear enough. I guess my question can be rephrased as "what does "deature importance" actually mean? How can it be interpreted?".

skinnider commented 10 months ago

I am probably not going to give a better explanation than in the randomForest documentation. In Augur, the importance values are then averaged over repeated subsamples for each cell type. In general, I would recommend using the results of a DE analysis with Libra to identify genes that are changing between conditions within individual cell types, rather than relying on feature importance.

tkapello commented 10 months ago

Thank you again, I was wondering more about your interpretation of the usability of "feature importance" in cell type prioritization. For example, would 20,000 important features correlate with higher robustness rather than 15,000 features? Or would you say there is a lower threshold of features that signals more confidence, e.g. "AUC=0.8 based on 18,000 important features" compared to "AUC=0.7 based on 10,000 important features"?

skinnider commented 10 months ago

In general I would say I don't really factor this in and go solely by the AUC. Many subsamples of equal size (default=50) are being performed for each cell type, so the fact that 18,000 features have an assigned feature importance doesn't mean that all 18,000 were being used by every classifier trained for that cell type. Feature importance can also be zero or negative, so just because a feature importance is assigned doesn't mean that gene is actually a feature that has a positive impact on classification.