omnideconv / SimBu

Simulate pseudo-bulk RNAseq samples from scRNAseq expression data
http://omnideconv.org/SimBu/
GNU General Public License v3.0
12 stars 1 forks source link

Simulation Strategy #2

Closed alex-d13 closed 2 years ago

alex-d13 commented 3 years ago

Overall strategy: sample all reads of cells from a list of cell-ids with known cell-type

How to store reads? Suggestion by Markus: use k-mer approach (like kallisto index https://pachterlab.github.io/kallisto/manual); then use k-mers to calculate TPM values. This could increase runtime and decrease memory space.

Questions on this:

  1. How much would you expect to reduce the file size with respect to the (zipped) FASTQ files?
  2. Can it slow down or complicate the downstream analysis (e.g. TPM and count quantification)?
  3. Would it make sense to start with the read-based strategy and, if we have more time, convert it to a k-mer based approach?
FFinotello commented 2 years ago

I have to think about it...

But I have a few quick comments: we could

alex-d13 commented 2 years ago

I did that now with T cells CD8 as reference and only show cell-types with 10 occurrences. I am not sure about using the same y-axis ( I see the point for better visual comparison), but for example I had to remove the plasma cells from this plot, since Monaco scales them so high, that it reaches a value of 30 in this case. Then we do not see anything for the other cell-types, since the y axis reaches up to 30.

image

I also tried out z-scores: i used the 0-1 normalized scaling values to calculate them (is that ok?).

image

I feel like the plot with the reference work quite well to see how many methods correspond with the same "direction" (more/less than reference), while the zscores maybe show a little bit better those methods, which really behave different than the rest (even though for most cell-types its all over the place..)