additive effects for background RNA correction in scVI

vitkl commented 1 year ago

Introduction

I would argue that after 5+ years of analysing droplet-based sc/snRNA-seq data, we know the top-N technical RNA detection effects:

Per-batch (10x reaction/inlet) detection rate originating from PCR+sequencing step, in part due to the sequencing budget normally allocated to 10x reactions.
Within-batch per cell RNA detection efficiency originating from RT variability between droplets.
Background RNA contamination originating from free-floating/ambient/soup RNA, cell barcode swapping, other mechanisms as described in detail by CellBender authors (https://github.com/broadinstitute/CellBender). This effect is specific to batch (10x reaction/inlet).
Technology and gene-specific RNA detection efficiency. For example, 5' vs 3' 10x technology. Other technologies such as probe-hybridisation-based methods could have the more complex gene- and batch-specific RNA detection efficiency effects.
Add yours

Note that here, the goal is to enumerate and correct for as many technical effects as possible retaining all other heterogeneity as biology - even if the biological change in expression is induced by experiment factors such as cell dissociation-induced stress or disease effects.

Effects 1, 2 and 4 are multiplicative in nature because they change RNA detection rates - while effect 3 is inherently additive, because ambient RNA is physically added to RNA from cells, and subject to the same RNA detection rates as a biological expression from cells. This suggests a more principled approach to correcting tech effect by modelling additive effects for each gene * batch and multiplicative effects for batch, cell and technology * gene. In scVI, effects 1 and 2 are corrected by cell-specific normalisation variable, effect 4 is corrected by parameters introduced via batch_key and categorical_covariate_keys. Effect 3 is not corrected within scVI but is corrected by an independent tool CellBender.

Would be great if scVI model could perform correction of additive background RNA. The approach can be limited compared to https://github.com/broadinstitute/CellBender but would be great to have an option.

Current scVI LDVAE could be viewed as follows:

$D{c,g} ~ NB(alpha=alpha{g}, mu=mu{c,g})$ $mu{c,g} = softmax{c} (z{c, f} @ weights{f, g} + y{e,g}) * y_{c} $

which loosely corresponds to the multiplicative effect in combat:

$mu{c,g} = softmax{c} (z{c, f} @ weights{f, g}) y_{c} y_{e,g}$

Non-linear decoder scVI adds the same multiplicative correction in the final layer:

$mu{c,g} = softmax{c} (z{c, h} @ weights{h, g}) y_{c} y{e,g}$ $z{c, h} = ReLu(z{c, f} @ weights{f, h} + y_{e,h})$

Notation:

c - cell
g - gene
e - batch_key
b - background_batch_key
t - categorical_covariate_keys
f - latent variables (n_latent)
h - hidden nodes (n_hidden)

Proposed modification

I propose introducing the following additional parameter

$mu{c,g} = softmax{c} (z{c, h} @ weights{h, g} + y{e,g} + y{t,g}) y{c} + (s{b,g} yBackground_{c})$

$z{c, h} = ReLu(z{c, f} @ weights{f, h} + y{e,h} + y_{t,h})$

where the following new variables are introduced:

s_{b,g} - batch-specific additive background
yBackground_{c} - cell-specific RNA detection efficiency for background RNA

where the following variables work exactly as current scVI:

y_{e,g} , y_{e,h} - parameters introduced via batch_key
y_{t,g} , y_{t,h} - parameters introduced via categorial_covariate_keys
y_{c} - cell size factors
z_{c, f} - latent variables
z_{c, h} - decoder, last layer hidden nodes
weights_{f, h}, weights_{f, h} - weights and biases in each layer

This should be fairly straightforward to implement - just adding 2 new variables and corresponding setup_anndata.

What do you think @adamgayoso?

adamgayoso commented 1 year ago

I think I need more time to digest this! But would you be willing to move it over to discourse? You can use $ $ for inline math and $$ $$ for math blocks there with latex rendering.

ricomnl commented 1 year ago

I've implemented a working version of this paper in an external scvi module: https://github.com/ricomnl/scvi-ar/pull/1/files. Am currently running some tests but I will create a PR to merge this in

vitkl commented 1 year ago

Thank you @ricomnl ! I will read that paper. One common issue is assuming that the background RNA constitute the same proportion rather than the same absolute number. This is a problem because cell types with more RNA are assumed to have more background which is not true for free-floating RNA. Cell barcode swapping can indeed lead to a higher background in cell types with more RNA - but barcode swapping seems to be lower than snRNA free-floating contamination. How does this paper address this?

scVI model is quite popular so there is also an argument for including background correction directly into scVI.

Another issue is that we don't have count matrices for empty droplets for many multi-atlas integration projects. Those are possible to generate by completely remapping data but that could be challenging due to resources required and/or data access. So good to have a middle ground option between no background correction and background correction that utilises empty droplets.

vitkl commented 1 year ago

@adamgayoso As we discussed elsewhere, the reason I created an issue rather than scverse discussion is that I think this is a very concrete implementable suggestion. However, I need some explanation of how to add/modify variables in scVI code.

martinkim0 commented 1 week ago

We have a version of this implemented in scvi.external.SCAR.

scverse / scvi-tools

additive effects for background RNA correction in scVI #1640

Introduction

Proposed modification