mRNA bias in SimBu - Githubissues

FFinotello commented 2 years ago

Hello!

The modeling of the mRNA bias is a unique feature of SimBu, so it has to be very robust.

Thus, before writing the paper, I think we should spend some time to review a bit the literature regarding mRNA bias in transcriptomic data.

Some deconvolution approaches already take it into consideration (EPIC, quanTIseq, ABIS)... is any second-generation method doing something similar (@LorenzoMerotto can you help?);
What is known about mRNA bias in scRNA-seq: Is it present? In which features does it manifest (e.g. num. expressed genes, total counts)?
Which approaches have been tempted to correct for it (e.g. spike-ins)?
Which scRNA-seq spike-in datasets are available? Could be used to validate our approaches.

grst commented 2 years ago

From my notes:

A recent extension of the MuSiC framework (Sosina et al.,) addresses different mRNA contents of different cell-types, enabling MuSiC to generate absolute scores that can be compared both between samples and cell-types.

Maybe Lorenzo has more...

FFinotello commented 2 years ago

Also related to this: the mRNA scaling factors we consider and compare at some point in a heatmap are quite different. I would have the following comments for our final comparative analysis.

For spike in datasets, the fraction of gene counts over total spike-in+gene counts, if supported by our literature review, can be seen as a silver-to-gold standard measure of mRNA content bias;
... spike in counts over total counts instead could be removed from our assessment as it is derived from above and anticorrelates with mRNA bias, of course;
EPIC scaling factors were derived experimentally -> gold standard for validation;
ABIS (Monaco) factors I think were also derived experimentally -- @alex-d13 could you please check?
Vento-tormo-derived factors: this dataset is very different from the others (for no obvious reason) -> I would remove it and select another one instead, even better if with spike-ins so we have a silver/gold-standard measure for validation.

alex-d13 commented 2 years ago

From my notes: A recent extension of the MuSiC framework (Sosina et al.,) addresses different mRNA contents of different cell-types, enabling MuSiC to generate absolute scores that can be compared both between samples and cell-types.

I read the Sosina et al. paper, I think they only added the option to add mRNA contents (they call it cell size), which are not calculated by MuSiC, but externally. If I understand the formulas and the code of MuSiC correctly, they estimate cell size as the mean library size of all cells of cell type X.

LorenzoMerotto commented 2 years ago

For deconvolution algorithms the mRNA bias is a bit of a grey topic, since several methods do not adress this issue clearly.

This is what I got from reading the various papers.

ABIS: the authors derived cell factors that were then used to correct the signature matrix, assuming that the users will use TPM-normalized counts. However, these factors can NOT be accessed. They provide just the corrected version of the signature matrix OMNIDECONV methods
MuSiC: as @alex-d13 mentioned, in the usual MuSiC model the cell sizes are estimated from the data considering the total number of counts. Sosina et al. just used their own estimates of cell sizes for the brain cells, and reported that the estimations got better. However, if we don't have the appropriate cell sizes we have to rely on the estimates made by the algorithm.
BseqSC: they consider the total number of counts per cell to scale the single cell counts.
SCDC: its model is similar to the MuSiC one so it should account for the different mRNA contents.
MOMF, Bisque: these methods use single cell data to "correct" the bulk RNAseq data. As the single cell data should be provided as raw counts, tey should account for mRNA bias as well
CDseqR: this method accounts for mRNA bias by computing the average count level for each cell type from the reference gene expression profiles. This is an optional information to provide, which for example is not currently available in omnideconv
Scaden: since this method is deep learning based, it can be that the hidden features will account for mRNA bias as well. For the other omnideconv methods there are no mentions of mRNA bias in the papers, maybe they adress is somehow but we should go through the code

alex-d13 commented 2 years ago

Hi all,

Francesca and I were analyzing the mRNA bias that we suspected is present in count data. Our assumption being that if we try to add a bias later on that is based on count data in a way (like with spike-ins, where we calculate the ratio of spike-in counts over all counts), this would mean that a mRNA bias is already contained in this count data. This would also mean that there is no/less bias contained in CPM/TPM data, since it will get lost in the normalization.

Setup

I will briefly describe the setup we used to test this assumption:

Two datasets were used to test: Hao (10x) and Travaglini (Smart-seq2); each dataset provided us with 2 matrices: one CPM/TPM matrix and a count matrix
Simulations with 300 samples and 1000 cells were performed using these datasets, with random cell type fractions in each sample
The cell types we used for the simulation were: Hao: DCs, TCD4, TCD8, B, Mono, NK, Tregs Travaglini: DCs, TCD8, TCD4, B, Mono, NK, Macro
for simplicity (and weird behavior of EPIC on TCD4) we used quanTIseq without mRNA scaling as deconvolution tool
because deconvolution tools did not perform well with raw count data, we scaled the simulations based on counts to 1e6: samples from these simulations are called cpm(bulk_counts).

To check if count data really does contain a bias, we tried to remove it. This is done by dividing the count matrix with a scaling factor. Two options are tested, number of reads per cell (_biasremoved (Reads)) and number of expressed genes per cell (_biasremoved (Genes)). We also did one run, where the bias was not removed (_biaskept). Because we only suspect this bias in count data, not in CPM/TPM data, we did not remove it there.

Results

We can see, that reads indeed remove a internal bias in count data. See how the estimates of Macrophages and NK cells in Travaglini or the Monocytes in Hao are less overestimated when comparing the 2nd and 4th panel. This means a cell type bias was present in the counts and was removed. Genes (panel 3) seem to not really remove any bias on the counts. We also see, that there is no bias present in the CPM/TPM data.

Next steps

Now that we know, that counts contain a bias, we want to add a paramter scale_singlecell_counts to SimBu, which (by default) removes this bias first. We will also have to check if simulations still follow the NB distribution and how a added scaling factor later on will influence deconvolution results.

This i quite a lot to read, if any question come up on our setup or results please let me know :) Looking foreward to your comments, @mlist !

Cheers, Alex

omnideconv / SimBu

mRNA bias in SimBu #19

Setup

Results

Next steps