jmbreda / Sanity

Filtering of Poison noise on a single-cell RNA-seq UMI count matrix
GNU General Public License v3.0
65 stars 11 forks source link

Batch correction with Sanity: assumptions on average expression #18

Closed pschupp closed 1 year ago

pschupp commented 1 year ago

When using Sanity as a batch correction method, do the batches have to be independent of biological condition? Specifically, what happens when one batch has a different average expression of a gene relative to another batch? This could happen if the batches are not from the same sample. Because the means are different does that mean that LTQs between the batches are incomparable? Should one check that the mean expression values are highly correlated before doing this batch correction via Sanity?

jmbreda commented 1 year ago

We feel most comfortable with using batch correction with Sanity in conditions where you expect that the true distributions of expression levels are roughly the same in each of the batches, i.e. where the cells in each batch correspond to the same biological condition. For low expressed genes even the means might be ‘noisy’ so it would not be too surprising if you find weak correlations for those, but if the mean expression levels of high expressed genes are also very different in different batches, then that would indicate that these different batches do not correspond to the same biological condition.

If the different batches correspond to different biological conditions, and you have no reason to expect systematic technical differences across the batches, then we would probably just analyze the cells from all batches together.

However, if you know that different batches are not only biologically different but also technically different, and you still want to analyze them together, then we would advice to simply run Sanity separately on each batch, take the output tables of LTQs and their error bars, and then concatenate them together into a larger table.

You might need to take some care with the fact that different genes might be missing from the outputs of different batches (because they had strictly zero UMI). Then we would either consider either the subset of genes at the intersection of the different batches, or take the union of genes of both batches taking the average LTQs and error bars of the 0 UMIs on the batch where it is expressed.