Sanity check: mRNA contents

grst commented 3 years ago

Our simulation strategy assumes that the bias in mRNA contents is reflected in the number of reads in the FASTQ files. For instance we assume that the average FASTQ file of a macrophage has more reads than the average FASTQ file of a T cell.

@alex-d13, to verify this, could you please make a plot for one of the datasets where you plot the cell-type against the number of reads per cell?

alex-d13 commented 3 years ago

@grst I did that for one dataset of the placenta (Vento-Tormo et al Nature). Looks like Macrophages really have much more reads than all other cell types :) Vento-Tormo_celltypes

grst commented 3 years ago

Thanks!

In a table listing mRNA contents (which I currently don't find any more :/) I believe the differences were higher. @ffinotello, can you please remind me how you retrieved the normalization factors for quanTIseq?

And in this paper, they also dealt with mRNA content biases: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6367568/

Let's further discuss that tomorrow!

FFinotello commented 3 years ago

Indeed, great point. I totally agree! And we should keep in mind that the number of reads per cell might depend on:

The state of the cell (e.g., whether it is activated, cycling, etc...)
The technology used for the sequencing (e.g. 10x Genomics, Smart-Seq2) - not so relevant for the simulation, but for the simulation-based benchmarking.

@grst maybe we could make the same plot on the 10x and Smart-Seq2 data from Maynard and the other NSCLC datasets? We could also start from the Maynard dataset as toy example, especially now that you have some curated cell-type annotations.

As for the plots, @alex-d13 could you use boxplots, so we can also see the variability?

As for quanTIseq scaling factors, they were computed as the median expression of the Proteasome Subunit Beta 2 (PSMB2) housekeeping gene and are available (also) through immunedeconv (TIL10_mRNA_scaling.txt).

EPIC has some similar scores (mRNA_cell_default.rda).

There is also this document by Miltenyi Biotec with the expected RNA yield per cell type that could be useful.

Cheers, Francesca

grst commented 3 years ago

Thanks, Francesca!

This was exactly the table I meant:

There is also this document by Miltenyi Biotec with the expected RNA yield per cell type that could be useful.

Here's the plot for the maynard dataset. median counts as csv

It seems there are already some inconsistencies:

In the milteny table, NK cells have lower mrna content than T cells
in our read-count analysis and quanTIseq they seem to have more.

@alex-d13, could you maybe collate the scaling factors from quanTIseq, EPIC, Milteny, Vento-Tormo, and Maynard and compare them (scatterplot or whatever makes sense)

FFinotello commented 3 years ago

If I remember well, the NK scaling factors from quanTIseq and EPIC did not agree neither.

@grst what about checking the number of reads per cell type considering your cell type annotations, but verifying them separately in the various datasets, i.e. not only Maynard but also the 10x ones?

alex-d13 commented 3 years ago

I created the plot to compare the scaling factors of the different sources we have right now. For this paper I did not find their raw scaling values, they only provides plots with them (fig 5C):

And in this paper, they also dealt with mRNA content biases: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6367568/

I basically scaled all values between 0.1 and 1.1 and show 9 different cell-types. I am not sure if that re-scaling makes perfect sense though..For Maynard & Vento-Tormo i used the mean read-count per cell-type as "scaling factor".

Seems like T-cells (and somewhat B cells) have similar scaling factors, while NK cells, Macrophages & Monocytes differ quite a lot.

grst commented 3 years ago

How did you do the scaling? I'd have expected every source to have values in [0.1, 1.1] in that case. And which annotation for Maynard are you using? There should be NK cells and Monocytes...

alex-d13 commented 3 years ago

Hi, yes the missing bars are my fault, these are sources with no annotation for that cell-type. Thats why they have values of 0 (or NA). I used this function for scaling: range01 <- function(x){(x-min(x))/(max(x)-min(x))}, and added +0.1 to all non-NA values. For Maynard you are right, was a typo on my end. Should I separate "Monocyte non-conventional" & "Monocyte conventional"?

Edit: updated the plot above.

grst commented 3 years ago

I'd have expected every source to have values in [0.1, 1.1] in that case.

What I meant is that milteny appears to have valuese only in [0.1, 0.4] making them harder to compare. Is it because some of the cell-types available in milteny are not shown?

Could you please also make a pairwise scatterplot (each dot is a cell-type, each panel a comparison of two sources)? That avoids the problem of scaling in the first place.

Should I separate "Monocyte non-conventional" & "Monocyte conventional"?

just merge them, but I see you already did that anyway :)

FFinotello commented 3 years ago

You could also take a cell type that is common to all methods as a reference, e.g. monocytes. That cell type will have a value of 1, and the others will be scaled accordingly. So if, for a method, scores were 10 for monocytes and 150 for DCs, they would become 1 for monocytes (reference) and DC 15 (150/10).

alex-d13 commented 3 years ago

What I meant is that milteny appears to have valuese only in [0.1, 0.4] making them harder to compare. Is it because some of the cell-types available in milteny are not shown?

Yes, thats true. Initially I first did the rescaling and the selected the cell-types. I changed the barplot above where I now first selected the types and then scaled between 0.1 and 1.1.

Could you please also make a pairwise scatterplot (each dot is a cell-type, each panel a comparison of two sources)? That avoids the problem of scaling in the first place.

I made this plot here, could be interesting for example to compare monocytes from Maynard against others: scaling_factors_scatter

@FFinotello I also tried out your suggestion, with having monocytes as a reference: scaling_factors_Monocytes

grst commented 3 years ago

wow, that's pretty bad (not the plots, they are great now :wink: ). Almost no agreement whatsoever!

FFinotello commented 3 years ago

@alex-d13 could you try another thing, so we have more comparisons? For methods that have subtypes (CD4, CD8, Treg for T cells, M1 and M2 for macrophages) you could also take their median to compare with the overall class (i.e., T cell and macrophages in the examples before). For methods that do not have the subtypes, you could assign the major type (e.g. T cell) to all teh subtypes (e.g. CD4, CD8, Treg).

Last questions: Why are many scores not available for Vento-Tormo dataset? And are DC or neutrophil scores available for any of the other methods?

alex-d13 commented 3 years ago

Ok, reworked the plots (upadated them above):

I removed some duplicate comparisons from the scatterplots.
For Vento-Tormo I assigned "T cells CD4", "T cells CD8" & "T regulatory cells" values with the "T cells" value; I also added the mean values of their assignments of "NK cells" & "Macrophages" sub-clusters.
For quantiseq "T cells" value i took the mean of the T cell-subtypes.
For miltenyi I added CD4 and CD8 T cells. They provide values for Myeloid (2.5) and Plasmacytoid DC (0.7), is there one which makes more sense to pick? Taking the mean here, is not so nice I feel like..

I also added plots with NK cells and T cells as reference for re-scaling (https://github.com/alex-d13/simulator/blob/main/plots/scaling_factors_NK%20cells.png & https://github.com/alex-d13/simulator/blob/main/plots/scaling_factors_T%20cells.png).

grst commented 3 years ago

One thing that I find weird about the milteny values:

How can T-cells have a higher mRNA content than any of its subtypes? Is it for undifferentiated T-cells?

FFinotello commented 3 years ago

Thanks, @alex-d13. I think there is some (qualitative) agreement finally on what should be expanded or reduced to account for mRNA bias.

I agree with @grst that the Miltenyi values are weird. I have to investigate that, to be sure we are not misinterpreting something. @grst would it be possible for you to remake the boxplots of the Maynards dataset with our final annotations (also with myeloid cell subsets)?

I was seeking the scaling factors from Monaco et al. paper to include them in the bar plots, but I could not find them. Maybe you guys are luckier than me ;)

I am also wondering whether, for single-cell studies, the average value is representative enough

FFinotello commented 3 years ago

The Census method from Trapnell lab could inspire us to derive total mRNA from transcript TPM, useful for this assessment and possibly also for correcting mRNA bias in deconvolution results.

Given the TPM value for a single transcript in cell i, it is straightforward to convert all relative abundances to their lysate transcript counts. We estimate the total number of mRNAs captured for cell i:

where 𝐹𝑋𝑖 represents the cumulative distribution function for the TPM values for cell i, ɛ is a TPM value below which no mRNA is believed to be present (by default, ɛ =0.1), and ni is the number of genes with TPM values in the interval (ɛ,x*i). That is, we simply calculate the total number of single-mRNA genes and divide this number by the fraction of the library contributed by them to estimate the total number of captured mRNAs in the cell.

grst commented 3 years ago

I was seeking the scaling factors from Monaco et al. paper to include them in the bar plots, but I could not find them. Maybe you guys are luckier than me ;)

It's already part of their signature matrix: "We generated normalized signature matrices for a set of immune cell types that were found to be suitable for RNA-seq and microarray deconvolution of PBMC samples, respectively."

When summing up the sigmat I get the following scaling factors:

import numpy as np, pandas as pd
np.sum(pd.read_csv("https://raw.githubusercontent.com/giannimonaco/ABIS/master/data/sigmatrixRNAseq.txt", sep="\t"), axis=0)


Monocytes C        22824.322779
NK                 21456.905980
T CD8 Memory       17827.199342
T CD4 Naive        14262.064545
T CD8 Naive        10660.952413
B Naive            24162.648934
T CD4 Memory       15682.087676
MAIT               13365.625992
T gd Vd2           28128.898864
Neutrophils LD      9546.742675
T gd non-Vd2       25008.840273
Basophils LD        8610.360645
Monocytes NC+I     22344.663722
B Memory           20837.569855
mDCs               57322.183941
pDCs               28488.194770
Plasmablasts      325800.989149
`

grst commented 3 years ago

Maynard plots with most recent annotation

read_count_per_cell_type.csv read_count_per_cell_type_coarse.csv

omnideconv / SimBu

Sanity check: mRNA contents #3