Batch correction with Sanity: assumptions on average expression

We feel most comfortable with using batch correction with Sanity in conditions where you expect that the true distributions of expression levels are roughly the same in each of the batches, i.e. where the cells in each batch correspond to the same biological condition. For low expressed genes even the means might be ‘noisy’ so it would not be too surprising if you find weak correlations for those, but if the mean expression levels of high expressed genes are also very different in different batches, then that would indicate that these different batches do not correspond to the same biological condition.

If the different batches correspond to different biological conditions, and you have no reason to expect systematic technical differences across the batches, then we would probably just analyze the cells from all batches together.

However, if you know that different batches are not only biologically different but also technically different, and you still want to analyze them together, then we would advice to simply run Sanity separately on each batch, take the output tables of LTQs and their error bars, and then concatenate them together into a larger table.

You might need to take some care with the fact that different genes might be missing from the outputs of different batches (because they had strictly zero UMI). Then we would either consider either the subset of genes at the intersection of the different batches, or take the union of genes of both batches taking the average LTQs and error bars of the 0 UMIs on the batch where it is expressed.

jmbreda / Sanity

Batch correction with Sanity: assumptions on average expression #18