Closed vitkl closed 1 week ago
I think I need more time to digest this! But would you be willing to move it over to discourse? You can use $ $ for inline math and $$ $$ for math blocks there with latex rendering.
I've implemented a working version of this paper in an external scvi module: https://github.com/ricomnl/scvi-ar/pull/1/files. Am currently running some tests but I will create a PR to merge this in
Thank you @ricomnl ! I will read that paper. One common issue is assuming that the background RNA constitute the same proportion rather than the same absolute number. This is a problem because cell types with more RNA are assumed to have more background which is not true for free-floating RNA. Cell barcode swapping can indeed lead to a higher background in cell types with more RNA - but barcode swapping seems to be lower than snRNA free-floating contamination. How does this paper address this?
scVI model is quite popular so there is also an argument for including background correction directly into scVI.
Another issue is that we don't have count matrices for empty droplets for many multi-atlas integration projects. Those are possible to generate by completely remapping data but that could be challenging due to resources required and/or data access. So good to have a middle ground option between no background correction and background correction that utilises empty droplets.
@adamgayoso As we discussed elsewhere, the reason I created an issue rather than scverse discussion is that I think this is a very concrete implementable suggestion. However, I need some explanation of how to add/modify variables in scVI code.
We have a version of this implemented in scvi.external.SCAR
.
Introduction
I would argue that after 5+ years of analysing droplet-based sc/snRNA-seq data, we know the top-N technical RNA detection effects:
Note that here, the goal is to enumerate and correct for as many technical effects as possible retaining all other heterogeneity as biology - even if the biological change in expression is induced by experiment factors such as cell dissociation-induced stress or disease effects.
Effects 1, 2 and 4 are multiplicative in nature because they change RNA detection rates - while effect 3 is inherently additive, because ambient RNA is physically added to RNA from cells, and subject to the same RNA detection rates as a biological expression from cells. This suggests a more principled approach to correcting tech effect by modelling additive effects for each
gene * batch
and multiplicative effects forbatch
,cell
andtechnology * gene
. In scVI, effects 1 and 2 are corrected by cell-specific normalisation variable, effect 4 is corrected by parameters introduced via batch_key and categorical_covariate_keys. Effect 3 is not corrected within scVI but is corrected by an independent tool CellBender.Would be great if scVI model could perform correction of additive background RNA. The approach can be limited compared to https://github.com/broadinstitute/CellBender but would be great to have an option.
Current scVI LDVAE could be viewed as follows:
$D{c,g} ~ NB(alpha=alpha{g}, mu=mu{c,g})$ $mu{c,g} = softmax{c} (z{c, f} @ weights{f, g} + y{e,g}) * y_{c} $
which loosely corresponds to the multiplicative effect in combat:
$mu{c,g} = softmax{c} (z{c, f} @ weights{f, g}) y_{c} y_{e,g}$
Non-linear decoder scVI adds the same multiplicative correction in the final layer:
$mu{c,g} = softmax{c} (z{c, h} @ weights{h, g}) y_{c} y{e,g}$ $z{c, h} = ReLu(z{c, f} @ weights{f, h} + y_{e,h})$
Notation:
Proposed modification
I propose introducing the following additional parameter
$mu{c,g} = softmax{c} (z{c, h} @ weights{h, g} + y{e,g} + y{t,g}) y{c} + (s{b,g} yBackground_{c})$
$z{c, h} = ReLu(z{c, f} @ weights{f, h} + y{e,h} + y_{t,h})$
where the following new variables are introduced:
where the following variables work exactly as current scVI:
This should be fairly straightforward to implement - just adding 2 new variables and corresponding setup_anndata.
What do you think @adamgayoso?