neurorestore / Augur

Cell type prioritization in single-cell data
MIT License
94 stars 10 forks source link

Using Augur with possible confounding effects #16

Closed zroger49 closed 2 years ago

zroger49 commented 2 years ago

Hi! I'm analyzing a dataset of single cell data and I'm interested in checking which celltypes are more perturbed for my experimental condition. The dataset contains annotation for not only the experimental condition, but also 2 other variables which I believe might be confounding (Sex and Disease status). I ran the augur analysis using only cells from healthy individuals as well as the complete dataset and I noticed there were changes in celltype ranking.

I'm wondering what is the recommended approach to control for possible confounding variables? Is there a way to specify the model matrix when using logistic regression? Or should I correct my data using batch correction tools (as exemplified in case study 5 )

Cheers

jordansquair commented 2 years ago

Sorry for the delay. Can you explain your experimental design a bit more? I'm not sure I completely understand what you mean when you say you ran Augur using only cells from healthy individuals.

zroger49 commented 2 years ago

Apologies, I should have been more clear.

My experimental design contain cells from human individuals, from both smokers and never smokers (which is my variable of interest). In addition, I also have information about Sex, Age and Lung Disease status, with the former having 3 levels (Healthy, COPD, IPF). From other analysis, I believe that this diseases status might be confounding. I ran Augur 2 times, using the entire dataset first (Healthy + COPD + IPF individuals) and then only the healthy individuals. Comparing both runs, there were changes in both AUC and ranking of my celltypes, which left me concerned.

What is the recommended approach to control for possible confounding variables?

jordansquair commented 2 years ago

My suggestion would be to either run the analysis separate (if you are only interested in the effect of smoking?). And perhaps you could consider differential prioritization if you have smokers/non-smokers for each disease. Otherwise I probably wouldn't try to regress out such a big confounding issue.

zroger49 commented 2 years ago

@jordansquair Sorry for the late response. I followed your suggestion, but I still have a different ranking of prioritized cell types depending if I use Healthy, COPD and IPF individuals to predict the Smoking phenotype, which might have an underlying biological interpretation.

I just wanted to check something quick with you. I'm looking at the feature importance of my genes. I ran my analysis with the option importance = "accuracy". When I inspect the feature importance table (View(augur_object$feature_importance)) what does the "importance" column means?. I'm assuming its the mean decrease in accuracy when that feature is included, meaning that the most important genes have the lowest importance value.

jordansquair commented 2 years ago

Sounds like there might be a biological reason - maybe the differential prioritization approach could be interesting for you.

By default Augur will provide you with the mean decrease in accuracy (see docs). Also see here for more details: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf