omnideconv / SimBu

Simulate pseudo-bulk RNAseq samples from scRNAseq expression data
http://omnideconv.org/SimBu/
GNU General Public License v3.0
12 stars 1 forks source link

mRNA bias in SimBu #19

Closed FFinotello closed 2 years ago

FFinotello commented 2 years ago

Hello!

The modeling of the mRNA bias is a unique feature of SimBu, so it has to be very robust.

Thus, before writing the paper, I think we should spend some time to review a bit the literature regarding mRNA bias in transcriptomic data.

grst commented 2 years ago

From my notes:

A recent extension of the MuSiC framework (Sosina et al.,) addresses different mRNA contents of different cell-types, enabling MuSiC to generate absolute scores that can be compared both between samples and cell-types.

Maybe Lorenzo has more...

FFinotello commented 2 years ago

Also related to this: the mRNA scaling factors we consider and compare at some point in a heatmap are quite different. I would have the following comments for our final comparative analysis.

alex-d13 commented 2 years ago

From my notes: A recent extension of the MuSiC framework (Sosina et al.,) addresses different mRNA contents of different cell-types, enabling MuSiC to generate absolute scores that can be compared both between samples and cell-types.

I read the Sosina et al. paper, I think they only added the option to add mRNA contents (they call it cell size), which are not calculated by MuSiC, but externally. If I understand the formulas and the code of MuSiC correctly, they estimate cell size as the mean library size of all cells of cell type X.

LorenzoMerotto commented 2 years ago

For deconvolution algorithms the mRNA bias is a bit of a grey topic, since several methods do not adress this issue clearly.

This is what I got from reading the various papers.

alex-d13 commented 2 years ago

Hi all,

Francesca and I were analyzing the mRNA bias that we suspected is present in count data. Our assumption being that if we try to add a bias later on that is based on count data in a way (like with spike-ins, where we calculate the ratio of spike-in counts over all counts), this would mean that a mRNA bias is already contained in this count data. This would also mean that there is no/less bias contained in CPM/TPM data, since it will get lost in the normalization.

Setup

I will briefly describe the setup we used to test this assumption:

To check if count data really does contain a bias, we tried to remove it. This is done by dividing the count matrix with a scaling factor. Two options are tested, number of reads per cell (_biasremoved (Reads)) and number of expressed genes per cell (_biasremoved (Genes)). We also did one run, where the bias was not removed (_biaskept). Because we only suspect this bias in count data, not in CPM/TPM data, we did not remove it there.

Results

image

image

We can see, that reads indeed remove a internal bias in count data. See how the estimates of Macrophages and NK cells in Travaglini or the Monocytes in Hao are less overestimated when comparing the 2nd and 4th panel. This means a cell type bias was present in the counts and was removed. Genes (panel 3) seem to not really remove any bias on the counts. We also see, that there is no bias present in the CPM/TPM data.

Next steps

Now that we know, that counts contain a bias, we want to add a paramter scale_singlecell_counts to SimBu, which (by default) removes this bias first. We will also have to check if simulations still follow the NB distribution and how a added scaling factor later on will influence deconvolution results.

This i quite a lot to read, if any question come up on our setup or results please let me know :) Looking foreward to your comments, @mlist !

Cheers, Alex