omnideconv / SimBu

Simulate pseudo-bulk RNAseq samples from scRNAseq expression data
http://omnideconv.org/SimBu/
GNU General Public License v3.0
12 stars 1 forks source link

Count diagnostics #11

Closed alex-d13 closed 2 years ago

alex-d13 commented 2 years ago

Hi,

Francesca pointed me towards a chapter we did not really cover yet with the simulator, comparing the simulated expression counts with true bulk RNAseq counts. Basically if the simulations even make sense and follow real world distributions (for raw and TPM/CPM counts).

Setup

To check this, I used the Finotello bulk PBMC dataset with FACS cell-type fractions as the "true samples". It contains raw counts and TPM counts. I then used the Travaglini (Smartseq2, lung) and Hao (10x, PBMC) datasets to simulate samples which follow these FACS cell-type fractions and have the same sequencing depth as the true samples. I also used the census mRNA bias for these simulations. I used both datasets twice, once with their raw counts and again with the TPM (CPM for 10x) counts. So I use the TPM/CPM datasets to simulate TPM/CPM samples (and the same for raw counts). That way we can compare true counts with simulated counts and true TPM with simulated TPM/CPM. [Note: for all other analysis I have already done, I always did this TPM approach, because the deconvolution tools require TPM data.]

Short notice on the raw counts & sequencing depth:

When I sample the cells based on a sequencing depth x, I use the provided "number of reads per cell" values provided by the studies. Each cell-type may contribute *xcell-type-fraction** amount of reads to the total depth, so I sample cells from one cell-type until this contribution is "filled up". Now if no scaling-factor is used, this works great, and each pseudo sample has - approximately - this wanted sequencing depth by simply summing up the counts from all sampled cell. However, if a scaling factor is used (like census), the counts are multiplied by some factor before I sample them (note that the depth is still technically correct) and the resulting pseudo sample will have extremely high count values when summing up. Francesca mentioned, that I could rarefy these pseudo samples now to the desired sequencing depth. With that we get pseudo samples with integrated scaling factors and raw counts, which meet this sequencing depth per sample. I used phyloseq for this.

For TPM/CPM counts this problem does not exist, because they will be normalized to 1e6 in the end anyways.

Results

raw counts

Travaglini

image image

Hao

image image

TPM/CPM

Travaglini

image image

Hao

image image

Raw vs TPM counts

I also compared how the raw and TPM counts for one sample compare with each other. I used one replicate from the Hao and Travaglini simulation and one true bulk sample, on which this simulated samples are based on (cell-type fractions & sequencing depth):

image

NB distribution

Count data should follow the negative binomial distribution. As Francesca suggested, I could test this by comparing the mean with the variance for the true and simulated dataset per gene.

Can we conclude from this that TPM values are generally better suited?

Looks like this again here. At least the counts seem to be more in line with the true biological samples.

image

alex-d13 commented 2 years ago

Updated "raw vs TPM" and "NB distribution" sections above.

mlist commented 2 years ago

This analysis was a really good idea and the results are interesting. It works comparably well for the Travaligni TPM values. Can we conclude from this that TPM values are generally better suited? I would not have expected gene length normalization to play a big role here but was any other normalization considered here?

Regarding the last plot I'm also puzzled. There is a mean/variance trend we expect: lowly expressed genes have a higher variance then highly expressed ones which means you should observe the inverse trend, at least on the level of genes. Granted, this is on the level of whole samples but still I'm a bit confused. That we don't see any change in the simulator is really weird.

mlist commented 2 years ago

Hmm now you see changes in mean and variance. Did you find a bug there?

alex-d13 commented 2 years ago

Hmm now you see changes in mean and variance. Did you find a bug there?

Yep, I just used the cell-type fractions & sequencing depth of one sample..so I only showed the 10 technical replicates.

mlist commented 2 years ago

Ok I guess I had an error in my thinking. If you have larger overall expression values, the deviations can also be larger, in % they would be the same.

alex-d13 commented 2 years ago

Sorry, ignore this plot for now, I should have calculated the mean/variance per gene not per sample how I did it. Will update it!

alex-d13 commented 2 years ago

Updated again the last section.

mlist commented 2 years ago

That looks quite good for the Travaligni data set. The overdispersion matches perfectly to the real data.

FFinotello commented 2 years ago

Agree with @mlist!

Hao data (CPM-based) seems more Poisson than NB... I am wondering why.

But overall the results are very nice, for both TPM and counts. This is very important because most of 1st-gen tools are based on TPM, while most of 2nd-gen tools are based on counts. We can systematically test all scenarios.

Two questions on my side:

All in all... very good job, @alex-d13 !

alex-d13 commented 2 years ago

Shall we also consider Hoek's bulk dataset for testing?

I added it to the mean-variance and raw vs TPM plots quickly. I am not sure if the density and scatter plots above would change much, since I would only change the sequencing depth and cell-type fractions based on one Hoek sample (which has less cell-types annotated by the way, so Travaglini will probably get more noisy).

image image

I would maybe already start to write up some of these steps I did (like the rarefying and the sample setup) in a new methods chapter.