Simulation Strategy - Githubissues

omnideconv / SimBu

Simulate pseudo-bulk RNAseq samples from scRNAseq expression data

http://omnideconv.org/SimBu/

GNU General Public License v3.0

12 stars 1 forks source link

Simulation Strategy #2

Closed alex-d13 closed 2 years ago

alex-d13 commented 3 years ago

Overall strategy: sample all reads of cells from a list of cell-ids with known cell-type

a simple database (e.g. sqlite, even a flat textfile could do it) that stores: cell-id, path to fastqs, cell-type annotation
sample n cells from the list in the wanted cell-type composition
simply concatenate all the fastq files
run salmon/kallisto to obtain TPM.

How to store reads? Suggestion by Markus: use k-mer approach (like kallisto index https://pachterlab.github.io/kallisto/manual); then use k-mers to calculate TPM values. This could increase runtime and decrease memory space.

Questions on this:

How much would you expect to reduce the file size with respect to the (zipped) FASTQ files?
Can it slow down or complicate the downstream analysis (e.g. TPM and count quantification)?
Would it make sense to start with the read-based strategy and, if we have more time, convert it to a k-mer based approach?

FFinotello commented 2 years ago

I have to think about it...

But I have a few quick comments: we could

use CD8 T cells as a reference (CD4 T cells are sometimes problematic to define)
in the reference-scaled plots, put an horizontal line at 1 so to easily spot when a factor is lower/higher than the ref
use the same y-axis limits to ease visual comparison
try option C: zscored scores
remove the spike-in score which is anticorrelated to the other scores (ERCC counts)

alex-d13 commented 2 years ago

I did that now with T cells CD8 as reference and only show cell-types with 10 occurrences. I am not sure about using the same y-axis ( I see the point for better visual comparison), but for example I had to remove the plasma cells from this plot, since Monaco scales them so high, that it reaches a value of 30 in this case. Then we do not see anything for the other cell-types, since the y axis reaches up to 30.

I also tried out z-scores: i used the 0-1 normalized scaling values to calculate them (is that ok?).

I feel like the plot with the reference work quite well to see how many methods correspond with the same "direction" (more/less than reference), while the zscores maybe show a little bit better those methods, which really behave different than the rest (even though for most cell-types its all over the place..)