sc.pp.regress_out and sc.pp.highly_variable interplay

scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.

https://scanpy.readthedocs.io

BSD 3-Clause "New" or "Revised" License

1.91k stars 601 forks source link

sc.pp.regress_out and sc.pp.highly_variable interplay #722

Open LuckyMD opened 5 years ago

LuckyMD commented 5 years ago

Hi all,

I've been wondering about this for a while. As sc.pp.regress_out only leaves residuals, the resulting expression values have 0 mean. Thus, you can no longer use sc.pp.highly_variable afterwards (it bins by mean expression value per gene). This seems like a bad idea. An easy fix would be to also keep the intercept value and not only the residuals from sc.pp.regress_out. What do you guys think?

If this sounds like a good idea to you, I will put it on my todo list for a pull request.

ivirshup commented 5 years ago

I think that makes sense to allow keeping a measure of magnitude, potentially implemented as an option, like with sc.pp.scales zero_center.

I'd be interested to see how different highly variable gene selection was on data transformed this way, vs the batched approach we have now.

LuckyMD commented 5 years ago

I wasn't aware that you could run sc.pp.scale without obtaining mean 0 at the end. Would that just scale the variance per gene then?

As for your question on HVG selection after sc.pp.regress_out vs in batches... I think that's an interesting question, but I reckon the two scenarios are actually not that related. I normally wouldn't use sc.pp.regress_out to remove batch effects, but rather to regress out continuous covariates like cell cycle scores. Batch effect removal is probably best done with methods that account for the variance contribution of the batch effect as well, such as Combat... or more complex data integration methods (Seuart, MNN, scanorama). Either way, it would be an interesting comparison... just with a caveat ^^.

JuliaChristiaanse commented 1 year ago

Hi @LuckyMD , Any updates regarding this issue? I am fairly new to scanpy and I am working on implementing regress_out() and finding HVG in the best way possible. I keep wondering whether or not I should regress out and scale before or after finding HVG. Any tips/updates? Everything is welcome :)