Closed abussing closed 3 months ago
Thanks for your question. The corrected counts in the SCT assay by default use median of the sequencing depth (median(nCount_RNA)) as the scaling factor. For this inequality to hold overall, you would need to scale each cell to its original sequencing depth which would IMO defeat the purpose of 'correcting' the counts. Hope this helps!
Thanks for your question. The corrected counts in the SCT assay by default use median of the sequencing depth (median(nCount_RNA)) as the scaling factor. For this inequality to hold overall, you would need to scale each cell to its original sequencing depth which would IMO defeat the purpose of 'correcting' the counts. Hope this helps!
Thanks for your response. Can you explain which step of my process is incorrect? The inequality is derived directly from the definition of $y_{gc}$ using simple algebra.
As for the reason that I want to 'uncorrect' the counts, I am just trying to work through the steps that created the corrected counts, and being able to 'uncorrect' them would verify that for me.
Corrected counts pbmc_small@assays$SCT$counts
are caculated on the median(nCount) value.
https://github.com/satijalab/sctransform/blob/8efb6929b4dd6da9c0849b5a19380e8fd0c4baeb/R/denoise.R#L83
My issue
I am trying to use the parameters from the GLM to work backwards from corrected counts $y{gc}$ to raw counts $x{gc}$.
Main idea
In the sctransform v2 paper, it says the corrected counts $y_{gc}$ are created by:
$$ y{gc} = \text{floor}\left(\frac{x{gc}-\mu{gc}}{\sigma{gc}} \sigma{gc}^{'} + \mu{gc}^{'} \right) $$
which means we can say that
$$ y{gc} \leq \frac{x{gc}-\mu{gc}}{\sigma{gc}} \sigma{gc}^{'} + \mu{gc}^{'} < y_{gc} + 1 $$
which further implies
$$ \left(y{gc} - \mu{gc}^{'}\right)\frac{\sigma{gc}}{\sigma{gc}^{'}} + \mu{gc} \leq x{gc} < \left(y{gc} + 1 - \mu{gc}^{'}\right)\frac{\sigma{gc}}{\sigma{gc}^{'}} + \mu_{gc} $$
and thus we have found an inequality that contains the raw counts $x_{gc}$.
Implementing in R
The $\mu{gc}$, $\mu{gc}^{'}$, $\sigma{gc}$, $\sigma{gc}^{'}$ from the above expressions are defined as:
$$ \begin{aligned} \mu{gc} &= \exp\left(\beta{g0} + \ln(nc)\right) \ \mu{gc}^{'} &= \exp\left(\beta_{g0} + \ln(n0)\right) \ \sigma{gc} &= \sqrt{\mu{gc} + \frac{\mu{gc}^2}{\thetag}}\ \sigma{gc}^{'} &= \sqrt{\mu{gc}^{'} + \frac{{\mu{gc}^{'}}^2}{\theta_g}} \end{aligned} $$
where $nc = \sum{g}x_{gc}$ and $n_0 = \text{median}\left(n_c\right)$.
$\beta_{g0}$, $\theta_g$, $n_c$, and $n_0$ are stored in the following locations of the sctransformed seurat object:
and so we have everything we need to calculate the inequality defined above.
Results
Now we apply this to the pmbc_small data set:
here is a look at 10 cells of gene LGALS1:
our bounds are doing an okay job, but I am wondering why they are not always containing the true value between them?