Simulation Strategy - Githubissues

alex-d13 commented 3 years ago

Overall strategy: sample all reads of cells from a list of cell-ids with known cell-type

a simple database (e.g. sqlite, even a flat textfile could do it) that stores: cell-id, path to fastqs, cell-type annotation
sample n cells from the list in the wanted cell-type composition
simply concatenate all the fastq files
run salmon/kallisto to obtain TPM.

How to store reads? Suggestion by Markus: use k-mer approach (like kallisto index https://pachterlab.github.io/kallisto/manual); then use k-mers to calculate TPM values. This could increase runtime and decrease memory space.

Questions on this:

How much would you expect to reduce the file size with respect to the (zipped) FASTQ files?
Can it slow down or complicate the downstream analysis (e.g. TPM and count quantification)?
Would it make sense to start with the read-based strategy and, if we have more time, convert it to a k-mer based approach?

mlist commented 3 years ago

Adding to the questions here, I think storing the kallisto index (generated with the pseudo command) might speed everything up as the pseudoalignment step has to be performed only once. I'd definitely give this a try. I'm not sure about space requirements but as FASTQ files have quite a bit of overhead it might be better, too. Please try it out so we know if it is a feasible approach.

alex-d13 commented 3 years ago

So regarding filesize, we will not gain anything. Kallisto indices are generally larger than the raw fastq-file (about 10-30%). I was wondering if I should - for paired end experiments - create the index for the foreward and reverse file together? Maybe that could save storage then.

Also just to get this straight for me: i would create the index for every fastq file/cell, then run kallisto pseudo on this index together with the corresponding fastq-file to get the pseudo alignments. But am I not aligning the reads onto the k-mers of those reads then? I think I am not getting something here..

grst commented 3 years ago

10-30% larger is still fine as we should get a significant speed boost! Is it possible to further compress the index (.tar.gz for instance)?

mlist commented 3 years ago

I'm not 100% sure about this but I think you possibly don't need to keep the index but we are fine with the equivalence class counts produced by the pseudo command. That should need less space since k-mers with the same equivalence class are aggregated as far as I understand -> no information should be lost as long as this is still done separately for each cell. Could you give that a try? If that doesn't work then Gregor is right, 10-30% more is still ok.

Best, Markus

On Wed, May 19, 2021 at 8:17 AM Gregor Sturm @.***> wrote:

10-30% larger is still fine as we should get a significant speed boost! Is it possible to further compress the index (.tar.gz for instance)?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/2#issuecomment-843779350, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWBXJ37D3SJCGWWKPQ7E3TONJYJANCNFSM444SMCHQ .

-- Dr. Markus List Head of the Research Group on Big Data in BioMedicine Chair of Experimental Bioinformatics TUM School of Life Sciences Weihenstephan Technical University of Munich (TUM) Freising-Weihenstephan, Germany

e-mail: @.*** tel: +49 8161 71 2761 www: http://biomedical-big-data.de orcid: https://orcid.org/0000-0002-0941-4168 twitter: @itisalist https://twitter.com/itisalist

alex-d13 commented 3 years ago

Hi, there are two new updates I have:

i was previously using kallisto index to build an index of a pair of fastq files and then use kallisto pseudo on these files and this created index to get the pseudoalignment. But I guess this makes not much sense, right? Rather I would have to use kallisto index once on the human reference transcriptome and then use kallisto pseudo with the fastq-pair and the reference index. I guess this is then what you meant with the "index" of the fastq-files, which we store.
Assuming this is correct, I tried kallisto like this: I used kallisto pseudo for three pairs of fastq files, each time creating the pseudo alignment of these files onto the reference transcriptome. And then I used kallisto merge on the three result folders. I am not 100% sure if I understand the files correctly (since I also cannot find any annotation of the result files), but what I get is this: matrix.cells matrix.ec matrix.tcc.mtx cells is just a column of cell names i used in the kallisto pseudo calls ec is something about the equivalence classes I believe, but the column annotation is missing. Basically two columns with numbers.. tcc.mtx i believe are transcript counts, but I am not sure. one column tells from which cell the transcript is from (1,2,3), the next the index of the transcript and then a count value. It looks something like this:
```
matrix.tcc.mtx
matrix_coordinate   real   general
3                   840477  82736
1                   161     1
1                   325     30
1                   341     16
1                   357     1
1                   451     42
1                   485     592
```
Now I am missing the names of the transcripts, but when I use kallisto pseudo --quant, I get a transcripts.txt file for each fastq-pair with a column of transcript names (where the row number is the index of each transcript i believe?). It does run the EM step then though for this single cell, which takes more time.

I found this https://github.com/pachterlab/kallisto/issues/275 post in the kallisto git, where I think they use these files to generate a transcript count table with a python script, tried it out and that works fine.

So in the end, I guess the approach works to get the transcript counts, i would have to calculate the TPM values by myself and i am not 100% sure if i understand every file correctly. Already posted an issue to the git repo, maybe they respond.

mlist commented 3 years ago

Hi Alex,

You need to generate the index only once for the reference transcriptome, alternatively use a ready-to-go index from their website https://github.com/pachterlab/kallisto-transcriptome-indices/releases

You can run kallisto pseudo to do the pseudoalignment but before you run EM (the quant) step you need to merge cells into a pseudobulk profile with kallisto merge before finally calling kallisto quant to run the EM algorithm. Step by step:

Generate reference index with kallisto index or use pre-built index
Pseudoalign cells individually (forward and reverse reads together) with kallisto pseudo
Select cells you want to include in your pseudo-bulk, e.g. select 500 cells with desired composition e.g. 50% cell type 1, 50% cell type 2. For these cells run kallisto merge which will aggregate the cell's EC counts and compute TPM values.

By the way, kallisto pseudo has a batch option which may simplify this even further. But I'm not sure about how this works.

Does that make sense?

Best, Markus

On Wed, May 19, 2021 at 4:25 PM Alexander Dietrich @.***> wrote:

Hi, there are two new updates I have:

i was previously using kallisto index to build an index of a pair of fastq files and then use kallisto pseudo on these files and this created index to get the pseudoalignment. But I guess this makes not much sense, right? Rather I would have to use kallisto index once on the human reference transcriptome and then use kallisto pseudo with the fastq-pair and the reference index. I guess this is then what you meant with the "index" of the fastq-files, which we store.

Assuming this is correct, I tried kallisto like this: I used kallisto pseudo for three pairs of fastq files, each time creating the pseudo alignment of these files onto the reference transcriptome. And then I used kallisto merge on the three result folders. I am not 100% sure if I understand the files correctly (since I also cannot find any annotation of the result files), but what I get is this: matrix.cells matrix.ec matrix.tcc.mtx cells is just a column of cell names i used in the kallisto pseudo calls ec is something about the equivalence classes I believe, but the column annotation is missing. Basically two columns with numbers.. tcc.mtx i believe are transcript counts, but I am not sure. one column tells from which cell the transcript is from (1,2,3), the next the index of the transcript and then a count value. It looks something like this:

matrix.tcc.mtx matrix_coordinate real general 3 840477 82736 1 161 1 1 325 30 1 341 16 1 357 1 1 451 42 1 485 592

Now I am missing the names of the transcripts, but when I use kallisto pseudo --quant, I get a transcripts.txt file for each fastq-pair with a column of transcript names (where the row number is the index of each transcript i believe?). It does run the EM step then though for this single cell, which takes more time.

I found this pachterlab/kallisto#275 https://github.com/pachterlab/kallisto/issues/275 post in the kallisto git, where I think they use these files to generate a transcript count table with a python script, tried it out and that works fine.

So in the end, I guess the approach works to get the transcript counts, i would have to calculate the TPM values by myself and i am not 100% sure if i understand every file correctly. Already posted an issue to the git repo, maybe they respond.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/2#issuecomment-844156741, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWBXJAK4KFWNKLLSGANG3TOPC6TANCNFSM444SMCHQ .

-- Dr. Markus List Head of the Research Group on Big Data in BioMedicine Chair of Experimental Bioinformatics TUM School of Life Sciences Weihenstephan Technical University of Munich (TUM) Freising-Weihenstephan, Germany

e-mail: @.*** tel: +49 8161 71 2761 www: http://biomedical-big-data.de orcid: https://orcid.org/0000-0002-0941-4168 twitter: @itisalist https://twitter.com/itisalist

alex-d13 commented 3 years ago

Hi Markus,

Yes, this is what I did (maybe I explained it a little bit confusingly), but I do not get TPM values out of kallisto merge. I only get these three files, which I explained earlier. I feel like this merging does not calculate anything new, it really just combines existing files. I also used the batch option, but how would you think that it could simplify things? I mean I could create this batch.txt file, where I tell kallisto pseudo, from which files it should calculate the pseudoalignment (50% cell-type1, 50% cell-type2). But dont we then loose the advantage of pre-calculation? Or did you mean to use this option in a different way?

Alex

FFinotello commented 3 years ago

I guess you can use the --quant option in kallisto pseudo, but then I think the EM algorithm is run on single cells, which is what we wanted to avoid.

But I have never tested this so I am not sure...

alex-d13 commented 3 years ago

@FFinotello yes, I tried that as well. Then I get abundances for each single cell, but the merging in the end still only provides me with the same 3 files.

mlist commented 3 years ago

I was not sure if the batch option really helps us, if it does not keep cells separated it does not. As for merge it said in the manual that TPM would be calculated but if that's not the case I'll have to investigate. I still think this is the right way to go.

Best, Markus

On Thu, May 20, 2021 at 9:43 AM Alexander Dietrich @.***> wrote:

@FFinotello https://github.com/FFinotello yes, I tried that as well. Then I get abundances for each single cell, but the merging in the end still only provides me with the same 3 files.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/2#issuecomment-844817309, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWBXOBY2SPXRZGBFOTLQTTOS4QJANCNFSM444SMCHQ .

-- Dr. Markus List Head of the Research Group on Big Data in BioMedicine Chair of Experimental Bioinformatics TUM School of Life Sciences Weihenstephan Technical University of Munich (TUM) Freising-Weihenstephan, Germany

e-mail: @.*** tel: +49 8161 71 2761 www: http://biomedical-big-data.de orcid: https://orcid.org/0000-0002-0941-4168 twitter: @itisalist https://twitter.com/itisalist

alex-d13 commented 3 years ago

Hi,

I looked at this topic again and tested the different kallisto options with 2 cells:

I don't think it is possible to generate TPMs with kallisto pseudo
i tried to run kallisto quant for both cells separately (which does create TPMs) and then merge them, but that does not work
when i run kallisto pseudo --quant for each cell separately, I only get read counts for the transcripts
i can then use kallisto merge to combine the results of both cells, but again they only output the count values (rounded now), now for both cells in a single file (i actually think all this option is doing is merging files, no calculations or anything even though the manual says so..)

Would it be a solution to calculate the TPM values by ourselfs from the counts which are collected by kallisto merge?

I also looked at salmon, they have an option to stop the programm before the quantification, but I would have to check out how these results look like and if they are useful.

grst commented 3 years ago

Would it be a solution to calculate the TPM values by ourselfs from the counts which are collected by kallisto merge

Getting TPM calculation right is not straightforward because you'll need an "effective gene length" based on the different isoform abudances. Formulas like this one are just hacks one can use if proper TPM are not available

There's also an option to run Salmon on BAM/CRAM files (only performing the EM step and calculating TPM). This is also super-fast, but yields (slightly) different result than when performing pseudoalignment.

Finally, @mlist, @alex-d13, could you please take a look at the census method Francesca mentioned in the other thread? They claim they can compute transcript counts from TPM values and I'm wondering if using the inferred transcript counts would be a more appropriate simulation strategy than relying on the number of detected reads per cell.

FFinotello commented 3 years ago

Hello crew!

I would like to follow up on this matter to be sure we are all on the same page.

Our initial idea for the pseudo-bulk simulation was to use the raw FASTQ files from a Smart-seq2 dataset so to have:

Full-transcript read coverage as in bulk RNA-seq;
Total-read count differences representing cell-type-specific mRNA bias. I would skip for now the Markus' cool idea of optimizing this approach using kmer indexing just for a sake of simplicity.

One major drawback of the approach above that we found was the limited correlation between total counts and (expected) mRNA in Smart-Seq2 datasets. Note: I am not completely sure about this, and I would rather investigate this further in additional Smart-Seq2 datasets.

We then considered another approach, based on the summation of precomputed gene counts (and TPM, as we can try both) scaled by some factors representing cell-type-specific mRNA content, so that cell type with more mRNA will get more counts. To compute this scaling factor, we can leverage the Census approach. Please note that we only need to estimate the scaling factor using their approach (i.e., the "total lysate mRNAs" M_i) for weighting the single-cell type contributions to the overall counts, but we don't need to apply the whole Census approach to transform gene data. As this approach is based on pre-computed data, we could also employ 10x data. And we can probably have expression profiles and scaling factors that are cell-specific, rather than cell-type specific.

One limitation of both approaches, especially for 10x data, is that low-to-medium expression genes are not represented at all in single-cell data, making pseudo-bulk data much less noisy than true bulk RNA-seq.

Does this reasoning make sense to you?

If yes, I think we could start implementing both approaches, possibly on a dataset:

That has matched Smart-seq2 and 10x data;
Was generated from blood or PBMC;
Has raw data and good cell type annotations.

Does this even exist? Or is there at least a good approximation of that? :)

Looking forward to discussing with you all, Francesca

FFinotello commented 3 years ago

Interesting tool and collection of well-annotated 10x PBMC datasets (including CITE-seq): https://www.biorxiv.org/content/10.1101/2021.05.20.445014v1.full.pdf

alex-d13 commented 3 years ago

Hi everyone,

I am currently working on implementing the code for the scaling factors and have a question: In the paper they talk about using 'the (log-transformed) TPM value in each cell i, written as x*i' and then 'simply estimate x*i as the mode of the log-transformed TPM distribution for cell i'. Now my question is: For the bulk samples, where we want to calculate the scaling factors with the TPM values, I do not have a TPM for each gene per cell, but only per sample right? Can I still apply their formula in the desired way or did I maybe not understand the formula correctly?

Cheers, Alex

grst commented 3 years ago

I'm not sure if I got it... By "scaling factors" you mean the census method? And why would you want to apply it to bulk samples?

alex-d13 commented 3 years ago

Oh yep..just thought about it as well and it does make no sense. The idea was to infer the scaling factors of scRNAseq datasets with census and compare them to bulk samples with known fractions. I think I got that mixed around, sorry.

alex-d13 commented 3 years ago

Hi,

I finished my implementation of census, in the end I mostly used the one from monocle but added a multi-core support for larger expression matrices. I also tried to implement the formula from the paper, but only got NaN values as result. I tried it on this Hao et al dataset of ~160k PBMC cells, that Francesca mentioned to me.

We then considered another approach, based on the summation of precomputed gene counts (and TPM, as we can try both) scaled by some factors representing cell-type-specific mRNA content, so that cell type with more mRNA will get more counts. To compute this scaling factor, we can leverage the Census approach.

But I think a problem is, that I am not using TPM values but rather gene expression counts. They also state in the monocle documentation that you should not use census on count data, only on TPM/FPKM.

I still tried to gather some of the results i got here, but I dont know how usefule they are: census_notebook.pdf

@grst you mentioned at some point that calculating TPM values from gene counts is not as precise, correct? Otherwise I would have calculated them for the dataset in a way.

grst commented 3 years ago

there's not such a thing as TPM for 10x data (there's no gene length information as only the 3' or 5' end of a gene is sequenced). But you'll get something equivalent by simply normalizing counts by library size (i.e. dividing the counts by the number of total counts per cell)

alex-d13 commented 3 years ago

Ah I did not know that. Ok, I will try this out next :)

FFinotello commented 3 years ago

Looking forward to seeing the results and how they compare with the previous mRNA scores we considered :)

alex-d13 commented 3 years ago

Hi, I have some results :)

So the paper I mentioned on which I am calculating the census counts, also provides mRNA counts which I compared first with the census counts. I am pretty surprised how good census works here, even though I used the normalized 10X counts, which they specifically recommend against in their documentation. (I applied census on the library size normalized expression counts and used a expression cutoff of 0, basically using all non-zero expression values; the original census as implemented in monocle uses 0.1 here, but this resulted in no remaining counts for the normalized data)

For the comparisons with other scores/fractions I made this scatter-plot again:

Finally the pearson correlation between the scaling factores/scores:

So it looks like only Maynard & Vento-Tormo do not agree at all with census, but we were skeptical about them anyways, right? Let me know what might be unclear or missing :) Cheers, Alex

grst commented 3 years ago

Thanks Alex!

So the paper I mentioned on which I am calculating the census counts, also provides mRNA counts which I compared first with the census counts.

Could you briefly summarize how they obtained "gold standard" mRNA counts?

So it looks like only Maynard & Vento-Tormo do not agree at all with census, but we were skeptical about them anyways, right? Let me know what might be unclear or missing :)

This is very interesting! That would basically suggest that our initial hypothesis that reads per cell are indicative of mRNA content does likely not hold.

alex-d13 commented 3 years ago

I did not really understand that technical part very well, I believe they applied DropSeq?

The pooled library was sequenced on an Illumina Nextseq (50 R1, 8 index, 34 R2). Post base calling, samples were aligned using a wrapper for DropSeqTools against the human reference hg19 to generate RNA counts matrices.

FFinotello commented 3 years ago

Hi @alex-d13, great job!

To really test our initial hypothesis as suggested by @grst, one could derive the scaling factors from the single-cell datasets using both approaches:

Tot reads per cell;
Census.

I would be curious to see how the correlation plot looks like if you include both.

One thing to keep in mind with the Smart-seq2 data is that you need to compute TPM in that case and not just scaled counts. I think you can use Kallisto for this.

One comment on the method scatterplot: you could even consider all the cell (sub)types that are in common between pairs of methods, so we have the full overview.

grst commented 3 years ago

I did not really understand that technical part very well, I believe they applied DropSeq?

The pooled library was sequenced on an Illumina Nextseq (50 R1, 8 index, 34 R2). Post base calling, samples were aligned using a wrapper for DropSeqTools against the human reference hg19 to generate RNA counts matrices.

That doesn't sound like anything special - so the mRNA counts are just regular gene expression values, no real quantification of mRNA molecules per cell.

FFinotello commented 3 years ago

@alex-d13 could you make also the scatterplot of the mRNA counts vs. census scores considering all cell types together (with proper colors)? I would be curious to see whether, besides the overall good correlation, Census is able to correctly capture the higher/lower mRNA content of certain cell types that should be visible as points not aligned to the diagonal.

alex-d13 commented 3 years ago

That doesn't sound like anything special - so the mRNA counts are just regular gene expression values, no real quantification of mRNA molecules per cell.

Yeah, thats what I suspected. So I am not sure how meaningful those comparisons are then.

could you make also the scatterplot of the mRNA counts vs. census scores considering all cell types together

@FFinotello do you mean the high-level cell-types (8 different ones) or deeper level (the ~30 facets in the plot)?

FFinotello commented 3 years ago

You could try the 8 major cell types... I guess our eyes are not so skill to distinguish 30 colors :-D

alex-d13 commented 3 years ago

@alex-d13 could you make also the scatterplot of the mRNA counts vs. census scores considering all cell types together (with proper colors)?

Seems like, census has some issues with B cells (probably because of the plasmablasts). census_vs_scaling_factors_ALL

I made some changes to the plots: Notably, I added more cell-types to the scatter-plots and used all cell-types which are present in both datasets to calculate the pairwise correlation. This does change the results quite a bit :) scaling_factors_scatter

census_correlation

Next steps will be to calculate census for the 2 smart-seq datasets (Maynard & Vento-Tormo) as well as do the basic reads per cell-type fractions for the Hao dataset, on which I currently only calculated the census counts.

FFinotello commented 3 years ago

Hi @alex-d13, thanks a lot for the results.

In the scatterplots, would it be possible to add correlation and use the full range for the x- and y-axis (e.g., it is difficult to see the relationship in the "quanTIseq vs. Monaco" plot)?

Thanks!

FFinotello commented 3 years ago

As for the non-perfect correlation, we cannot say it's an issue of the Census approach. It could be that Census is correctly normalizing for mRNA bias. We would need some validation data to prove that.

alex-d13 commented 3 years ago

In the scatterplots, would it be possible to add correlation and use the full range for the x- and y-axis (e.g., it is difficult to see the relationship in the "quanTIseq vs. Monaco" plot)?

Updated the plot above.

alex-d13 commented 3 years ago

I added mean census fractions for the Maynard and Vento-Tormo datasets. Also I made some cell-type differentiation more precise (for example before I combined 3 classes "Monocytes", "decidual Macrophages 1" & "decidual Macrophages 2" into the class "Monocytes" in Vento-Tormo; now i have them as 3 separate types). This made some correlations worse (mostly effected is Vento-Tormo). Edit: also added mean number of reads for Hao (but only used mapped reads here, for the other datasets i used all reads in a cell)

scaling_factors_scatter

And again the correlation plot: census_correlation

alex-d13 commented 3 years ago

Here is an overview of the spike-in experiments, that the census paper made (they basically followed the steps of this paper):

The spike-in transcripts are added in the same quantity to each cells lysate as "naked" RNA and are then reverse transcribed and amplified to build the cDNA library. They can then use the different expression profiles of these transcripts to convert the number of sequences transcripts to real abundances in the cells.

In the paper they used 8 datasets with ERCC spike-ins of various different tissues, one of which is also in my datasets list for Smart-seq2 sets (Segerstople et al.). They have 2.2k cells of the prancreas with FACS annotated cell-types (but not on a very detailed level; see Fig 1G ). I will try to integrate the census counts and spike-in controls into the correlation heatmap to get some kind of gold standard to compare to.

alex-d13 commented 3 years ago

A quick update on the Spike-in dataset:

I used the spike-in counts in two ways:

for each spike-in transcript (ERCC-00XXX), I calculated the scaling factor as the mean of sum(ERCC-counts)/sum(gene counts) for each cell-type. These are all the ERCC-..... columns in the heatmap.
i used the % of ERCC counts of the number of total reads per cell. This is the "Trivaglini (spike-in %)" column.

The results are quite interesting, the spike-in counts seem to go up for celltypes where other methods say that this celltype has less mRNA. I searched around a little bit for a better understanding of these Spike-ins and found this tutorial, where they use spike-ins to normalize cell-type specific biases (so exactly what we want to simulate). They used an easy way to calculate the normalization factors ("size-factors"), where they simply take the sum of spike-in counts for each cell, which I added as a third column:

mean (sum of Spike-in transcripts) per cell-type. This is the "Trivaglini (spike-in total)" column.

But somehow, this approach does not correlate with almost any of our other methods..

I think one issue could be, that these spike-ins are only to the cell lysate, so they are not effected by the different mRNA content of each cell. It seems to me like they are used mostly to detect technical issues, such as capture efficiency and sequencing depth between cells.

Still, the census paper has used spike-ins to validate the performance of the ability of their census algorithm to estimate mRNA contents of cells. But until now I did not find the code and explanation on how they did that/how they used the spike-in counts.

census_correlation

PS: @FFinotello I also added the median TPM values for the housekeeping gene you mentioned to me (PSMB2) for two datasets (Hao==CITESeq and Vento-Tormo==SmartSeq2; the Maynard dataset has no expression profile for this gene, they only have ~6000 genes)

grst commented 3 years ago

Thanks for the update!

Maynard should have all genes, I can check next week if I accidentally sent you a filtered file (leaving on holiday today)...

On Mon., Jul. 26, 2021, 13:56 Alexander Dietrich, @.***> wrote:

A quick update on the Spike-in dataset:

I used the spike-in counts in two ways:

for each spike-in transcript (ERCC-00XXX), I calculated the scaling factor as the mean of sum(ERCC-counts)/sum(gene counts) for each cell-type. These are all the ERCC-..... columns in the heatmap.

i used the % of ERCC counts of the number of total reads per cell. This is the "Trivaglini (spike-in %)" column.

The results are quite interesting, the spike-in counts seem to go up for celltypes where other methods say that this celltype has less mRNA. I searched around a little bit for a better understanding of these Spike-ins and found this tutorial http://bioinformatics.age.mpg.de/presentations-tutorials/presentations/modules/single-cell//bioconductor_tutorial.html#2.6.2-Computing-separate-size-factors-for-spike-in-transcripts, where they use spike-ins to normalize cell-type specific biases (so exactly what we want to simulate). They used an easy way to calculate the normalization factors ("size-factors"), where they simply take the sum of spike-in counts for each cell, which I added as a third column:

mean (sum of Spike-in transcripts) per cell-type. This is the "Trivaglini (spike-in total)" column.

But somehow, this approach does not correlate with almost any of our other methods..

I think one issue could be, that these spike-ins are only to the cell lysate, so they are not effected by the different mRNA content of each cell. It seems to me like they are used mostly to detect technical issues, such as capture efficiency and sequencing depth between cells.

Still, the census paper has used spike-ins to validate the performance of the ability of their census algorithm to estimate mRNA contents of cells. But until now I did not find the code and explanation on how they did that/how they used the spike-in counts.

[image: census_correlation] https://user-images.githubusercontent.com/40594032/126984766-d8c83dec-00f9-4fa9-9baf-454de39d4fc7.png

PS: @FFinotello https://github.com/FFinotello I also added the median TPM values for the housekeeping gene you mentioned to me (PSMB2) for two datasets (Hao==CITESeq and Vento-Tormo==SmartSeq2; the Maynard dataset has no expression profile for this gene, they only have ~6000 genes)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/2#issuecomment-886636127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVZRV7YID4FLGP3OP5C6KTTZVEPDANCNFSM444SMCHQ .

FFinotello commented 3 years ago

Thanks @alex-d13! I will have a closer look later today and give you some feedback.

Could you add also a simplified plot without the correlations for single ERCC? Thanks!

As for the cell vs. tot-ERCC score, I would expect them to anticorrelate, as you have a fixed number of reads and the spike-ins "steal" a part of them? But I have to reason on this a bit more.

alex-d13 commented 3 years ago

Here the reduced heatmap:

census_correlation_reduced

grst commented 3 years ago

for each spike-in transcript (ERCC-00XXX), I calculated the scaling factor as the mean of sum(ERCC-counts)/sum(gene counts)

Maybe we need to do exactly the inverse? The reasoning is that the amount of spike-in is constant for each cell.

mlist commented 3 years ago

Intuitively, I agree with Gregor, the ERCC counts are your reference, e.g. the other way around.

On Mon, Jul 26, 2021 at 4:02 PM Gregor Sturm @.***> wrote:

for each spike-in transcript (ERCC-00XXX), I calculated the scaling factor as the mean of sum(ERCC-counts)/sum(gene counts)

Maybe we need to do exactly the inverse? The reasoning is that the amount of spike-in is constant for each cell.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/2#issuecomment-886730197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWBXKHPWAQUSZFXYETRRTTZVTGNANCNFSM444SMCHQ .

-- Dr. Markus List Head of the Research Group on Big Data in BioMedicine Chair of Experimental Bioinformatics TUM School of Life Sciences Weihenstephan Technical University of Munich (TUM) Freising-Weihenstephan, Germany

e-mail: @.*** tel: +49 8161 71 2761 www: http://biomedical-big-data.de orcid: https://orcid.org/0000-0002-0941-4168 twitter: @itisalist https://twitter.com/itisalist

FFinotello commented 3 years ago

@grst this would be my thought exactly, but I could not find a description of how the spike-in were added to the libraries. Is any info of this kind available.

If the spike-in mix is added in an equal amount (i.e. total mRNA) to all libraries, I would expect the percentage of reads coming from the ERCC spike-ins to decrease as the endogenous mRNA (cell-type-specific) increases. This, assuming that the overall sequencing depth is comparable between cells.

alex-d13 commented 3 years ago

Maybe we need to do exactly the inverse? The reasoning is that the amount of spike-in is constant for each cell.

Mh ok I see. But the amount is in fact not constant for each cell.

but I could not find a description of how the spike-in were added to the libraries.

True, I also could not find anything in the paper regarding the spike-ins. I am not sure if they even used them in their analysis.

FFinotello commented 3 years ago

What is on the x-axis here?

On Mon, 26 Jul 2021, 16:51 Alexander Dietrich, @.***> wrote:

Maybe we need to do exactly the inverse? The reasoning is that the amount of spike-in is constant for each cell.

Mh ok I see. But the amount is in fact not constant for each cell. [image: image] https://user-images.githubusercontent.com/40594032/127009221-501f21d1-c6d4-4a6b-bab1-005f3d8a0c0c.png

but I could not find a description of how the spike-in were added to the libraries.

True, I also could not find anything in the paper regarding the spike-ins. I am not sure if they even used them in their analysis.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/2#issuecomment-886771408, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF6C62QMNNLNK4FOSDAWOL3TZVY7NANCNFSM444SMCHQ .

grst commented 3 years ago

I wouldn't expect it to be, that's why you add the spike-ins in the first place: To correct for technical biases introduced by the experiments ;)

If they were constant, you could use the counts directly instead.

On Mon., Jul. 26, 2021, 16:51 Alexander Dietrich, @.***> wrote:

Maybe we need to do exactly the inverse? The reasoning is that the amount of spike-in is constant for each cell.

Mh ok I see. But the amount is in fact not constant for each cell. [image: image] https://user-images.githubusercontent.com/40594032/127009221-501f21d1-c6d4-4a6b-bab1-005f3d8a0c0c.png

but I could not find a description of how the spike-in were added to the libraries.

True, I also could not find anything in the paper regarding the spike-ins. I am not sure if they even used them in their analysis.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/2#issuecomment-886771408, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVZRV5J5XNBICR62YEB2CLTZVY7NANCNFSM444SMCHQ .

alex-d13 commented 3 years ago

What is on the x-axis here?

for each of the ~9k cells i summed up the counts of all ~90 spike-in transcripts.

If they were constant, you could use the counts directly instead.

Ah yes, makes sense. I am only not sure if and how I should include spike-in transcripts, which have 0 counts, since there quite a lot of them.

alex-d13 commented 3 years ago

Hi, just a summary of the scaling factors, including complete plots:

Analyzed Datasets & Scaling factors:

epic: pre-calculated scaling factors
quantiseq: pre-calculated scaling factors (median expression of the Proteasome Subunit Beta 2 (PSMB2))
miltenyi: measured mRNA yields
monaco: pre-calculated scaling factors (RNA yield / n_cells)
Maynard: Smart-Seq2, Lung cancer, 20k cells
- Maynard (reads): mean(n_reads) per cell-type
- Maynard (census): mean(census_counts) per cell-type (applied on salmon generated TPMs; expression-cutoff=0.1; expected capture rate=0.25)
Vento-Tormo: Smart-seq2, Placenta, 70k cells
- Vento-Tormo (reads): mean(n_reads) per cell-type
- Vento-Tormo (census): mean(census_counts) per cell-type (applied on pre-calculated TPMs; expression-cutoff=0.1; expected capture rate=0.25)
- Vento-Tormo (PSMB2): mean(n_reads) per cell-type but only for gene PSMB2
Hao: CITE-seq, PBMC, 160k cells
- Hao (reads): mean(n_reads) per cell-type
- Hao (census): mean(census_counts) per cell-type (applied on expression counts, scaled by library size; expression-cutoff=0; expected capture rate=0.25)
- Hao (PSMB2): mean(n_reads) per cell-type but only for gene PSMB2
Trivaglini: 10x & Smart-seq2, lung + circulating blood, 75k cells, includes ERCC spike-in counts
- Trivaglini (reads): mean(n_reads) per cell-type
- Trivaglini (spike-in %): mean(sum of all ERCC counts / total reads) per cell-type
- Trivaglini (spike-in total): mean(sum(ERCC counts)) per cell-type
- Trivaglini (genes per ERCC): mean(gene_counts/sum(ERCC counts)) per cell-type

P1: pairwise pearson correlation between scaling factors (for some pairs only a few equal cell-types exist, this can reduce the significance of some correlation values)

census_correlation_reduced

P1.1: scatterplot of all available cell-types between each pair of scaling-factors https://github.com/alex-d13/simulator/blob/main/plots/scaling_factors_scatter.png

P2: re-scaled the scaling factors by the reference cell-type Monocytes (https://github.com/alex-d13/simulator/issues/3#issuecomment-844860659) scaling_factors_Monocytes

P3: re-scaled the scaling factors by the reference cell-type NK cells scaling_factors_NK cells

FFinotello commented 3 years ago

Among the ERCC scores, I'd say that it would make sense to consider the spike in %... But reversed as: (tot_counts - spike_counts)/tot_counts. This way, you quantify the endogenous mRNA content per cell.

alex-d13 commented 3 years ago

Here is an updated heatmap:

Travaglini (Spike-in: % not ERCC)

this is the method of using spike-ins that Francesca proposed.

The number of expressed genes and census have perfect correlation in all 4 datasets. So it does seem like census does not add much information here.. Also I decided to split up the Vento-Tormo dataset into the two tissues it contains. Read-counts are pretty bad to use for this dataset..

alex-d13 commented 3 years ago

Hi all,

regarding a comparison of scaling factors on the cell-type level: I think we decided to use one cell-type as "reference" and scale all other values according to it (https://github.com/omnideconv/simulator/issues/3#issuecomment-844860659), because if i just show the 0-1 scaled values, it might be harder to compare them if not all cell-types are shown for a dataset/method. I did this now for all 22 scaling-factor methods, once with T cells CD4 as reference (A) and once without reference, only showing the 0-1 scaled values (B) [I only display cell-types which occur >6 times].

However I feel like the second approach (B) gives a better representation of the real differences of scaling factors in cell-types, because for the other approach we basically only see how a scaling factor represents a cell, compared to the reference (T cells CD4). Thats why in many cells we see this huge bar for "Travaglini genes per spike-in", because this method has a big difference in the T cells CD4 but not necessarily in all the other cell-types. I wanted to include figure B now, to get a feeling how good/bad the scaling factors behave for the different cell-types.

What do you think? Or do you have maybe different ideas on this?

A

B

omnideconv / SimBu

Simulation Strategy #2