Closed alex-d13 closed 2 years ago
Hi @alex-d13 and thanks for organizing this material.
Here are my first comments and suggestions for the various figures:
Graphical abstract I will comment on this later, once you have gathered some feedback from the rest of the team. I am wondering if more panels could be beneficial, e.g. to explain mRNA bias (see panel c in this fig for example), represent possible simulation scenarios (e.g. even, random), or illustrate other features of SimBu (any ideas?).
Diagnostics I am curious to see the results on the Tabula Muris data. I'd say we can eliminate panel (b): it was qualitatively insightful for us but might be misleading for readers/reviewers.
A more general comment for the whole paper: "TPM/CPM" could get interpreted as a ratio. More generally, alternating between "TPM" and "CPM" could be confusing. I would suggest defining CPM and TPM as soon as possible and the reason for this discrepanxy (i.e. 10x vs. Smart-seq2 data) as soon as possible in the ms), and re-define them jointly as "normalized counts" (or something similar). I would then use this throughout the ms, if a greater level of detail is not requested.
Extreme mRNA bias
Maybe we could split this figure into two panels: a. Extreme bias -> it illustrates well the matter we are introducing (i.e. high mRNA content -> overestimation). I would only report quanTIseq, EPIC, and CIBERSORT-absolute results. b. Real-word mRNA bias as in the latest figures you made. Maybe we could consider only quanTIseq, with and without mRNA correction. We could show one dataset in the main text and put in the supplementary the other ones.
mRNA bias comparison
This could be the first figure we introduce in the ms after the graphical abstract. I agree with Alex's comments. I would definitely discard Vento-Tormo's data and Myltenyi factors, which we could not interpret. For the spike-in-related measures, I would only keep the one we use for introducing the bias in our simulation. More generally, in this plot we could consider all the possible scaling factors (Census, totcounts, nfeatures...) and refer to them with exactly the same names we use in the parameter settings. The purpose of this figure is to validate data-driven approaches on cell types for which experimentally-derived factors are available. The beauty is that they can be also applied to cell types for which experimentally-derived factors are not available. For validation purposes, I agree with Alex that we should highlight somehow the experimentally-derived factors: EPIC and Monaco, correct!? Maybe we could use a square around their column and rows? For the datasets, I would write in brackets the tissue and technology. Regarding this, do we have access to a PBMC smart-seq2 dataset as well? Question: What are the colors and numbers representing? This should be made clear in the plot. We should put in the supplementary the scatterplots underlying these correlations.
Additional plots
We should show somewhere the mRNA bias distribution (violin plots) for the various single-cell datasets in terms of nfeatures and totcounts. But I guess this would be more appropriate for the supplementary.
In immunedeconv paper (supplementary?), Gregor made a plot comparing pseudobulk and bulk obtained from the same sample. @grst which was this dataset on which we had single-cell and bulk RNA-seq from the same samples? Do we still have it? Such a comparison would be very informative and could be put in the same figure of the NB distribution. If there are some cell types with high mRNA content, maybe we also see some differences in pseudobulk data generated with and without bias.
I will think about the order of the figures and whether we need to "visualize" some other important messages. I will get back to you in the next few days.
I agree that we should resolve the TPM/CPM ambiguity early on. But normalized counts are misleading since this is normally used for e.g. DEseq2 normalized counts. I would just say that we use TPM throughout the manuscript even though CPM is used when we are dealing with 10X data where gene-length bias is not an issue.
"To account for differences in sequencing depth and gene length, we use transcripts per million (TPM) throughout this manuscript. Note that for 10X data where a gene length bias is not of concern we use counts per million instead but also refer to this as TPM)."
Alternatively we could say normalized expression values (NEV).
which was this dataset on which we had single-cell and bulk RNA-seq from the same samples? Do we still have it?
That's Schelker et al. https://www.nature.com/articles/s41467-017-02289-3. It should be part of the data archive of my benchmark pipeline. If not, it definitely was available as described in the manuscript. It's only 3 samples though.
We also have 10x data and bulk RNA-seq of the organoids at ICBI (6 samples), but it's still unpublished and I doubt it will be before the simulator.
mRNA bias comparison
One question on the heatmap setup: should we keep the rows/columns in alphabetical order (right) or ordered by hierarchical clustering based on the correlations (left)? I almost prefer the alphabetical order, keeps it a little bit more organized.
Question: What are the colors and numbers representing? This should be made clear in the plot.
Colors are pearsons correlation coefficient, numbers are the count of matching cell types for each correlation calculation. I will add a small custom legend for this once we decided on a design.
We should put in the supplementary the scatterplots underlying these correlations.
I agree, though this figure might become quite messy with so many plots at once..
We should show somewhere the mRNA bias distribution (violin plots) for the various single-cell datasets in terms of nfeatures and totcounts. But I guess this would be more appropriate for the supplementary.
Could be something like this, but I am not so happy with it yet..
Update on the count diagnostics:
I used the integrated 10x and SmartSeq2 (SS2) tabula muris (TM) spleen dataset, that Lorenzo provided (thank you @LorenzoMerotto :) ). Using this, I created 10 simulations (=10 replicates) with the same cell type fractions as in a true bulk dataset with FACS annotation (Petiprez, 4 spleen samples). So we get a matrix with 40 samples. Then I compare the mean and variance on a gene level of these two simulations and of the true bulk dataset. There is also a human PBMC true bulk dataset (Finotello) in the plot, just for comparison; it did not influence the simulation setup.
Note that I am comparing only the count data here, and i removed a cell type bias prior to the simulation sampling using the number of mapped reads per cell. Also minor notice: TM SS2 has only 3, TM 10x only 4 cell types that are matching with the true bulk dataset ( "B cells" "T cells CD8" "T cells CD4" ) and ("NK cells" "B cells" "T cells CD8" "T cells CD4" ). But I don't think that has a huge influence on this setup here.
While this figure looks quite similar to the one where I used Travaglini (SS2) and Hao (10x) as basis for the simulation (see below), the dispersion values in the TM setup are much closer to the true bulk dataset!
I think using this tabula muris setup to show that SimBu can simulate counts correctly works pretty well for the manuscript :)
Of course it would make more sense to compare the tabula muris datasets with a true bulk dataset from mouse, not human. I did this and updated the post above.
Hi @alex-d13 thanks for the plots. They look great! If you have enough samples, you cpuld maybe just select the speen ones from the Petitprez dataset. The plot and dispersion estimate should be more meaningful in this case.
Hi @alex-d13 thanks for the plots. They look great! If you have enough samples, you cpuld maybe just select the speen ones from the Petitprez dataset. The plot and dispersion estimate should be more meaningful in this case.
Thats what I did just now :D see the updated plots above, there I use the 4 spleen samples in petiprez.
Hi Alex, it looks great.
A quick check: did you use the custom scenario specifying the same cell fractions for both SS2 and 10x? Were you using the same cell types?
A quick check: did you use the custom scenario specifying the same cell fractions for both SS2 and 10x? Were you using the same cell types?
No i didn't use the same fractions, because the SS2 dataset had no NK cells. But for the other cell types the proportional differences are the same, i just rescaled the values to sum up to 1.
In this case I would use the same, eventually discarding non-matching cells. So we can really prove that the distributions are similar.
That's Schelker et al. https://www.nature.com/articles/s41467-017-02289-3. It should be part of the data archive of my benchmark pipeline. If not, it definitely was available as described in the manuscript. It's only 3 samples though.
@grst these are 3 bulk RNA-seq samples... and are the single-cell data from the same 3 samples? Which cell types are annotated?
yes, there are single-cell data (It's one of 3 datasets they integrated for their analysis). They annotated the major immune cell-types, Fibroblasts and Epithelial cells if I remember correctly.
I just looked at that dataset, but they only provide TPMs, no count data. I am not sure that our setup works correctly without counts though.. In the meantime, Lorenzo provided me with an additional mouse bulk dataset (Wuaiping) inlcuding a few spleen samples. Here is an updated plot for the count diagnostics, all simulations are using the tabula muris dataset. Row 1 is based on the petiprez cell type fractions, row 2 is based on the wuaiping dataset:
Would you say this figure is sufficient for the manuscript to show that SimBu can simulate samples correctly?
This would be a draft for the graphical abstract. a shows the simulation setup, b shows the scenarios and c the idea of a scaling factor. Let me know what you think about it :)
Very nice! One quick question: does the custom
scenario need to have the same composition in all samples? This might be maybe confusing.
Updated above :)
Hi @alex-d13 wonderful work!
Some quick answers and suggestions on the points you raised above (sorry for my late feedback).
One question on the heatmap setup... At first, I was undecided, but now I think the clustering help see better similarities like Census vs. n_genes. I find the numbers a bit confusing, as one might be more interested in the correlation values. If you are using
corrplot
, maybe you could report the values of the correlation and encode the number of common cell types in the circle size. As for scatterplots, they are indeed a lot. Maybe you could show the single-cell data-driven ones vs. the experimental bulk (EPIC and Monaco) as a sort of validation.We should show somewhere the mRNA bias distribution (violin plots) Have you tried violin instead of box plots? You could sort them by median totcounts per cell type. We could prepare these plots for any datasets (including mouse ones) and then select at the end what to show in the main or supplementary.
@alex-d13 what are the coeff of dispersion when you consider only bio replicates (i.e. no tech replicates)? How do they relate to the real ones?
@alex-d13 what are the coeff of dispersion when you consider only bio replicates (i.e. no tech replicates)? How do they relate to the real ones?
added them to the plot
This would be a draft for the graphical abstract. a shows the simulation setup, b shows the scenarios and c the idea of a scaling factor. Let me know what you think about it :)
I like the new graphical abstract. Some comments:
Hi,
as discussed, here would be a collection of figures to have in the manuscript and supplementary. The final manuscript has a word limit of 5000 and a page limit of 7 including all figures and tables.
Simulation setup
TODO: use 'real' cell type images instead of generic shapes.
Diagnostics
Something like this form the thesis, but with the new tabula muris dataset we will have.
Results
Extreme mRNA bias
I think some plot, where we add an extreme bias to some cell types, would work nicely to help describe the features of SimBu. Something similar to this from the thesis (maybe not all deconvolution methods and cell-types which are present here). We could focus on those cell types where the impact of an extreme scaling factor is easy to spot, like NK, B or TCD4.
mRNA bias comparison
I did this large correlation heatmap in my thesis, but I think if we use experimental scaling factors like EPIC, we should highlight them in a way. I think we already decided to remove the Vento-tormo dataset. I would then only keep Travaglini, as its Smart-seq2 and has spike-ins and Hao (10x). Of course we would only keep the spike-in row, that SimBu is also using. When the tabula muris dataset is ready we can also take a look at how it compares, but is still is non-human data, so that has to be mentioned. Also the number of comparible cell types is an issue in this plot, as sometimes a correlation coefficient is calculated with only 5-7 values. So overall I am not really happy with this figure, but also not so sure of alternatives.
One more..
I feel like one more 'final' figure would be nice, would you have any ideas? @mlist @FFinotello