satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
203 stars 33 forks source link

Working back to raw counts from corrected counts? #185

Closed abussing closed 3 months ago

abussing commented 3 months ago

My issue

I am trying to use the parameters from the GLM to work backwards from corrected counts $y{gc}$ to raw counts $x{gc}$.

Main idea

In the sctransform v2 paper, it says the corrected counts $y_{gc}$ are created by:

$$ y{gc} = \text{floor}\left(\frac{x{gc}-\mu{gc}}{\sigma{gc}} \sigma{gc}^{'} + \mu{gc}^{'} \right) $$

which means we can say that

$$ y{gc} \leq \frac{x{gc}-\mu{gc}}{\sigma{gc}} \sigma{gc}^{'} + \mu{gc}^{'} < y_{gc} + 1 $$

which further implies

$$ \left(y{gc} - \mu{gc}^{'}\right)\frac{\sigma{gc}}{\sigma{gc}^{'}} + \mu{gc} \leq x{gc} < \left(y{gc} + 1 - \mu{gc}^{'}\right)\frac{\sigma{gc}}{\sigma{gc}^{'}} + \mu_{gc} $$

and thus we have found an inequality that contains the raw counts $x_{gc}$.

Implementing in R

The $\mu{gc}$, $\mu{gc}^{'}$, $\sigma{gc}$, $\sigma{gc}^{'}$ from the above expressions are defined as:

$$ \begin{aligned} \mu{gc} &= \exp\left(\beta{g0} + \ln(nc)\right) \ \mu{gc}^{'} &= \exp\left(\beta_{g0} + \ln(n0)\right) \ \sigma{gc} &= \sqrt{\mu{gc} + \frac{\mu{gc}^2}{\thetag}}\ \sigma{gc}^{'} &= \sqrt{\mu{gc}^{'} + \frac{{\mu{gc}^{'}}^2}{\theta_g}} \end{aligned} $$

where $nc = \sum{g}x_{gc}$ and $n_0 = \text{median}\left(n_c\right)$.

$\beta_{g0}$, $\theta_g$, $n_c$, and $n_0$ are stored in the following locations of the sctransformed seurat object:

beta_g0 <- pbmc@assays$SCT@SCTModel.list$counts@feature.attributes$`(Intercept)`
theta_g <- pbmc@assays$SCT@SCTModel.list$counts@feature.attributes$theta
n_c <- pbmc@meta.data$nCount_RNA
n_0 <- median(pbmc@meta.data$nCount_RNA)

and so we have everything we need to calculate the inequality defined above.

Results

Now we apply this to the pmbc_small data set:

pbmc_small <- Seurat::SCTransform(pbmc_small, assay = "RNA")

undone_counts <- undo_counts(corrected_counts = pbmc_small@assays$SCT$counts,
                             feature.attributes = pbmc_small@assays$SCT@SCTModel.list$model1@feature.attributes,
                             cell.attributes = pbmc_small@assays$SCT@SCTModel.list$model1@cell.attributes) 

here is a look at 10 cells of gene LGALS1:

github_pic

our bounds are doing an okay job, but I am wondering why they are not always containing the true value between them?

saketkc commented 3 months ago

Thanks for your question. The corrected counts in the SCT assay by default use median of the sequencing depth (median(nCount_RNA)) as the scaling factor. For this inequality to hold overall, you would need to scale each cell to its original sequencing depth which would IMO defeat the purpose of 'correcting' the counts. Hope this helps!

abussing commented 3 months ago

Thanks for your question. The corrected counts in the SCT assay by default use median of the sequencing depth (median(nCount_RNA)) as the scaling factor. For this inequality to hold overall, you would need to scale each cell to its original sequencing depth which would IMO defeat the purpose of 'correcting' the counts. Hope this helps!

Thanks for your response. Can you explain which step of my process is incorrect? The inequality is derived directly from the definition of $y_{gc}$ using simple algebra.

As for the reason that I want to 'uncorrect' the counts, I am just trying to work through the steps that created the corrected counts, and being able to 'uncorrect' them would verify that for me.

saketkc commented 3 months ago

Corrected counts pbmc_small@assays$SCT$counts are caculated on the median(nCount) value. https://github.com/satijalab/sctransform/blob/8efb6929b4dd6da9c0849b5a19380e8fd0c4baeb/R/denoise.R#L83