Compute log fold change with linear model

stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data

https://stuartlab.org/signac/

Other

324 stars 87 forks source link

Compute log fold change with linear model #1035

Open danielcgingerich opened 2 years ago

danielcgingerich commented 2 years ago

One thing that has always bugged me about the way fold change is calculated is that it is not adjusted for latent variables. Only the p value calculation controls for confounding variables. Referring to the 'LR' test commonly used for scATAC data

Why not use a linear model to calculate fold change?

formula might look something like:

log(peak) ~ group + latent.vars

perhaps also applying the log transformation to the latent variables as well.

group is the categorical variable you are testing, and latent.vars are confounders. If I understand correct, the beta coefficient for groupwould therefore be the logFC of the two groups adjusted for confounders. I believe the MAST DE test does this for scRNA

danielcgingerich commented 2 years ago

@timoast any word on this? Another person at my work recently asked me the same question. No rush!

timoast commented 2 years ago

I think what you suggest is very reasonable, although it would not be the fold change so I wouldn't necessarily replace the fold change column in the FindMarkers output with this. I don't have bandwidth to work on this right now but you could make a PR for Seurat if you're interested in implementing it, or you could make a separate function in a new package

danielcgingerich commented 2 years ago

@timoast Hey Tim, thanks for getting back to me. I am a little confused by how it would not be a log fold change. If the dependent variable is log transformed, and the predictor variable is binary (0 or 1), then the slope should be the log fold change. This would be the case for modeling log(TF-IDF count) ~ disease_status. Maybe I am misunderstanding something - would like to know your thoughts!

timoast commented 2 years ago

Fold change is defined as the ratio between the two counts, this would be slightly different if it took latent variables into account, that's all I mean. Without latent variables in the model I agree it's the same

danielcgingerich commented 2 years ago

@timoast Gotcha, thank you!

Would you say it is more common for researchers to report fold change as the unadjusted ratio between two values, or to adjust for latent variables as well? Not strictly referring to single cell but other bioinformatic fields too

timoast commented 2 years ago

I haven't seen many cases where people adjust fold change for latent variables, but I don't see any problem with reporting that as long as it's clear what's being calculated. IMO, calling it "fold change" most people would probably assume that means a simple ratio between two values