Simulation Implementation

alex-d13 commented 2 years ago

Hi all,

I have a view ideas & questions which i would like to discuss with you:

as a database from which to pull the cells, i would probably create a simple tsv file with cell_id, cell_type, file_name, n_reads, ...
i would write some R script, which calculates the cell-type fractions and writes a file with all the corresponding file names
those files i can pipe into kallisto or salmon to get the TPMs
a final R script would apply the scaling factors on those TPM values
to make it easier to run, I would probably build a snakemake pipeline for everything
regarding the number of reads per cell: I just compared 2 datasets, and they already do not correspond in read length. How would you handle it if cells with different read lengths will be sampled? The transcript quantification should still work right?
single/paired-end: if i sample a specific cell, I guess I will use both fastq-files of this cell (foreward & reverse) right?

During this I will again look into the whole kallisto merge topic, maybe I can find a solution for this to increase runtime/decrease memory space.

I would appreciate your inputs on these points :)

grst commented 2 years ago

Given everything we have learned, the question is if we still want to simulated based on FASTQ files, or if we are OK with just using the counts and aggregate them in one way or the other (i.e. census vs. sum).

For the former, I don't think an R package is suitable. It needs to be highly parallel and portable to HPC. For this a workflow manager (nextflow, snakemake) is most suitable.

If we just accept a count matrix (from smartseq2 or 10x or whatever), we can work with an R package.

alex-d13 commented 2 years ago

Wouldn't we have the problem then, that different datasets may have calculated their counts differently? For example the Maynard and Vento-Tormo datasets use different mappers for alignment (STAR and HISAT2 respectively). I am not sure how big this impact would be in the end though.

grst commented 2 years ago

Well, it would be left as an exercise to the user to provide counts preprocessed in a format they like. We could always refer to best-practice pipelines from nf-core as the recommended way to preprocess data.

alex-d13 commented 2 years ago

Ok I see. So I would have to either look for datasets processed in a similar way or do that part by myself for the first version, correct? Or just ignore the different pipelines for now, since in the end the user would fill the database by himself anyways?

grst commented 2 years ago

Maybe best to also discuss that with @mlist...

mlist commented 2 years ago

Yes, we should discuss this.

I'm still thinking about kallisto as a means to store data efficiently. kallisto actually merges reads into equivalence classes > reads that are compatible with the same transcripts are collapsed (and counted). This would be even more efficient than keeping track of the k-mer counts and scale to hundreds of millions of reads efficiently. A cell could than be stored as a vector of counts for the different equivalence classes. Alex, could you check out how the kallisto bus method works? Apparently it reports such equivalence classes but again the question is how we can get further from there.

If the kallisto approach fails we should probably stick to pre-processed count matrices and take into account that they might differ slightly for different pipelines.

Best, Markus

On Wed, Aug 4, 2021 at 11:52 AM Gregor Sturm @.***> wrote:

Maybe best to also discuss that with @mlist https://github.com/mlist...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/5#issuecomment-892523908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWBXPEYGB7FXDZIYJYZITT3EEWZANCNFSM5BPH2YYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Dr. Markus List Head of the Research Group on Big Data in BioMedicine Chair of Experimental Bioinformatics TUM School of Life Sciences Weihenstephan Technical University of Munich (TUM) Freising-Weihenstephan, Germany

e-mail: @.*** tel: +49 8161 71 2761 www: http://biomedical-big-data.de orcid: https://orcid.org/0000-0002-0941-4168 twitter: @itisalist https://twitter.com/itisalist

FFinotello commented 2 years ago

Hey guys, thanks for the discussion! I have a few comments/questions.

As I mentioned here, I would leave open the possibility to input the cell-type-annotated single-cell data (to be decided in which format, see below) together with some user-specified cell fractions per sample or even to select some simulations scenarios we have to implement as @grst did for the immunedeconv paper.
Regarding the format, maybe we could start from a simple approach where we aggregate just the library-normalized counts or TPM, depending on the technology, but scaled by the estimated mRNA content scores.
As mRNA content scores, we could offer four options: 1) no scaling, 2) census, 3) spike-in (if present in the data), 4) user-provided (e.g., one might provide quanTIseq, EPIC, Monaco's scores if they cover the cells of interests).

@alex-d13 I completely forgot whether the PBMC data you are using is also available as raw FASTQ files. It would be helpful to analyze them with Kallisto as @mlist proposed, also in order to check their information content when expression levels are considered for genes, transcripts, k-mers. Related question: is there a publicly available Smart-seq2 dataset from human PBMCs?

alex-d13 commented 2 years ago

Hi,

as for kallisto bus, I am not sure if it works with smart-seq2 data correctly. Initially kallisto bus does not support smart-seq2: They basically need to know the technology in order to know the base positions of barcode, UMI and sequence but there would be the option to add these positions manually if a not yet supported technology is used.

Now i read a little bit about smart-seq2 and apparently it is not easy to add UMIs to the reads, since you would have to add them after the reverse transcription. Also in this thread they state the issues with smart-seq2 and BUS.

I again tried around with kallisto pseudo, but I am only able to get equivalence counts for the cells, no transcript abundances and I did not find a way yet to run the EM algorithm separately on these equivalence classes with kallisto.

@FFinotello unfortunately, I did not find the access to the fastq files of the PBMC dataset I am currently using, apparently the database ID has no entry?

All raw sequencing data are deposited in the dbGaP under the accession number dbGaP: phs002315.v1.p1.

Should I write the authors on this or rather look for a different dataset? I might have found a PBMC dataset, which was analyzed with multiple RNA-seq technologies, inlcuding Smart-seq2: Systematic comparative analysis of single cell RNA-sequencing methods . Would that work?

FFinotello commented 2 years ago

Hey Alex, thanks!

Why would you need UMI information for Kallisto quantification?

As for the PBMC dataset, is it accessible (raw and processed counts) and annotated in terms of cell types?

mlist commented 2 years ago

Yeah, I think they focus on UMI-based techniques here. Smart-seq2 may not be suited for bustools at all. Would the approach generally work though?

On Thu, Aug 5, 2021 at 1:07 PM FFinotello @.***> wrote:

Hey Alex, thanks!

Why would you need UMI information for Kallisto quantification?

As for the PBMC dataset, is it accessible (raw and processed counts) and annotated in terms of cell types?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/5#issuecomment-893370146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWBXPKHZXUZOBJX6KQXLDT3JWHTANCNFSM5BPH2YYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Dr. Markus List Head of the Research Group on Big Data in BioMedicine Chair of Experimental Bioinformatics TUM School of Life Sciences Weihenstephan Technical University of Munich (TUM) Freising-Weihenstephan, Germany

e-mail: @.*** tel: +49 8161 71 2761 www: http://biomedical-big-data.de orcid: https://orcid.org/0000-0002-0941-4168 twitter: @itisalist https://twitter.com/itisalist

FFinotello commented 2 years ago

The PBMB dataset does not look available (yet).

Maybe, for the kallisto kmer evaluation (NOT simulation) we could use the Maynard dataset.

@mlist how does that sound to you?

alex-d13 commented 2 years ago

Would the approach generally work though?

Mh I am not sure. I think the pre-calculation is always pretty straight forward (either with kallisto bus or kallisto pseudo), but then assembling multiple outputs into a single count matrix seems to be not so easy right now. The only solution I found right now would be to use kallisto merge on the kallisto pseudo results, but then we would only have equivalence class counts, no quantified transcripts.

But let me try around a little bit more, maybe i can find something :)

FFinotello commented 2 years ago

We could also have a look at other pseudoaligner like Salmon

mlist commented 2 years ago

A major question is actually if we need transcript counts at all, the purpose of the equivalence classes is to capture transcript compatibility of the different reads efficiently but if the signatures we generate are on gene-level we don't need to make it so complicated and can stick to gene counts per cell... do the second gen tools use transcript-level information?

@Francesca: for our k-mer evaluation this dataset should be fine.

On Thu, Aug 5, 2021 at 3:20 PM FFinotello @.***> wrote:

We could also have a look at other pseudoaligner like Salmon

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/5#issuecomment-893454223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWBXI7XBZ3KPCO5ZKPVKTT3KFYNANCNFSM5BPH2YYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Dr. Markus List Head of the Research Group on Big Data in BioMedicine Chair of Experimental Bioinformatics TUM School of Life Sciences Weihenstephan Technical University of Munich (TUM) Freising-Weihenstephan, Germany

e-mail: @.*** tel: +49 8161 71 2761 www: http://biomedical-big-data.de orcid: https://orcid.org/0000-0002-0941-4168 twitter: @itisalist https://twitter.com/itisalist

alex-d13 commented 2 years ago

As for the PBMC dataset, is it accessible (raw and processed counts) and annotated in terms of cell types?

Unfortunately I can't find any annotation files. They did kNN cluster annotation though I can't find these in any meta or supplementary table.

Maybe, for the kallisto kmer evaluation (NOT simulation) we could use the Maynard dataset.

When talking about the kmer evaluation, I am not quite sure if I understand the concept here. Is it to compare the kallisto TPMs with some other form of transcript/gene counts for the dataset?

A major question is actually if we need transcript counts at all, the purpose of the equivalence classes is to capture transcript compatibility of the different reads efficiently but if the signatures we generate are on gene-level we don't need to make it so complicated and can stick to gene counts per cell... do the second gen tools use transcript-level information?

I think they use gene-level information (is that correct @FFinotello ?). But wouldn't we need to get these gene-counts also in some similar way as the transcript counts? So would this even help us for the whole kallisto story?

We could also have a look at other pseudoaligner like Salmon

I got a response from one of the authors, who I asked if it is possible to extract the k-mer counts of salmon:

Hi @alex-d13,

Do you mean the counts of k-mers in the underlying reference, or the counts of k-mers within reads, or something else? During indexing, the compacted de Bruijn graph index is created, and at that point extracting counts for reference k-mers is possible, though there is currently not a command for it. This is something that might be easy to add to the underlying pufferfish index, so please feel free to raise the issue there if this is your interest.

During mapping / alignment, the algorithm doesn't actually produce k-mer counts at any point since the selective-alignment algorithm is based on finding uniMEMs (maximal exact matches between reads and unitigs in the compacted dBG), chaining them together, and scoring potential alignments. So at this point, we're not really attempting to count matched k-mers but looking at a somewhat more sophisticated notion of mapping.

Best, Rob

--> So it seems to me like the k-mer counts of the input reads are not possible to extract with salmon.

I guess for now I will simply implement an approach with pre-calculated TPM matrices, where I combine the count-columns of the sampled cells. I guess it is important then to have TPM matrices with counts for all genes; @grst do you have the counts for the Maynard dataset for all genes? I currently only have access to a version with 6k genes.

grst commented 2 years ago

the PBMC dataset,

that's the one from the Satijalab? I can take another look if I find the annotations.

do you have the counts for the Maynard dataset for all genes

yes. I'll send you a link by email.

alex-d13 commented 2 years ago

that's the one from the Satijalab? I can take another look if I find the annotations.

For that I already have the annotations (its the "Hao" dataset which I used in the scaling factor analysis). But they are lacking the sequencing files. I was looking for the annotations of this dataset.

grst commented 2 years ago

They have some R code for automatically assigning the cluster labels here: https://bitbucket.org/jerry00/scumi-dev/src/master/R/assign_cluster.R

Given my bad experience with automatic assignment methods, we are maybe better off by re-annotating the datasets ourselves manually.

On Wed, 11 Aug 2021 at 20:57, Alexander Dietrich @.***> wrote:

that's the one from the Satijalab? I can take another look if I find the annotations.

For that I already have the annotations (its the "Hao" dataset which I used in the scaling factor analysis). But they are lacking the sequencing files. I was looking for the annotations of this https://doi.org/10.1101/632216 dataset.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alex-d13/simulator/issues/5#issuecomment-897072097, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVZRVZRQKRG55CYN2DWC7DT4LBZPANCNFSM5BPH2YYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

alex-d13 commented 2 years ago

Yes, maybe the annotation would be a useful step, so that we finally have a full PBMC dataset with sequence files, annotation and TPMs. I don't have huge experience in this topic, what would be the best practices to perform such a cell-type annotation? Should I just follow a seurat vignette like this?

FFinotello commented 2 years ago

Hello @alex-d13,

Just to recapitulate, there are two lines of research at the present: simulator and k-mer signatures. Please, @mlist and @grst correct or integrate my suggestions below wherever needed.

Simulator

We could start from a couple of preprocessed 10x datasets where we can have gene TPM, counts, and some reasonably good cell-type annotations. I would use the Tabula Muris (mouse) and the human PBMC dataset (even the CITE-seq one). We should have all the data above, correct!? And, of course, we can remove any cells that appear to have low-quality data or ambiguous phenotypes. Maybe @grst can provide some guidance in this? Then, we could build a framework to generate pseudo-bulk profiles according to: 1) a specified cellular composition and 2) specific modeling of mRNA bias (none, Census, ERCC spike-ins, user-specified). This can be then extended to have, in addition to the user-specified cellular compositions, a more complex set of possible scenarios automatically modeled as in Gregor's paper (spillover, background, etc.) The original scripts should be available, correct @grst ?

k-mer sigantures

The idea is to see whether we can find cell-type-specific k-mers, computed with some pseudoaligner from Smart-seq2 data, which can be used for the signatures in spite of the simple gene counts.

Starting from Markus' original proposal, we are now facing two issues with pseudoaligners like Kallisto and Salmon, as they:

Work on 10x but not Smart-seq2 data.
Do not save k-mer counts.

Regarding 1, are we sure about this for both methods (or even alternative ones)?

As for 2, could you check better the definition of uni-MEM in the paper? I would say that they are also RNA sub-sequences (like k-mers, but with variable lengths) that could show some cell-type-specific patterns. Let's discuss this... especially in case I am overlooking or misinterpreting something ;-)

Cheers, Francesca

grst commented 2 years ago

I don't have huge experience in this topic, what would be the best practices to perform such a cell-type annotation? Should I just follow a seurat vignette like this?

Yes, unfortunately, this is still the way to go. Here's a set of immune-cell markers that I've been using so far: https://docs.google.com/spreadsheets/d/1beW-9oeM31P50NFvNLVvsdXlfh_tOjsmIrnwV2ZlxDU/edit#gid=825350711

If it's just PBMC data, you could also try the Azimuth app from the satijalab: https://azimuth.hubmapconsortium.org/ It projects new data onto an annotated reference dataset and looks promising.

And, of course, we can remove any cells that appear to have low-quality data or ambiguous phenotypes. Maybe @grst can provide some guidance in this?

I would expect the two datasets you mentioned above to be already filtered for low quality cells. Regarding the ambigous phenotypes I think we previously discussed an idea to look at the neighborhood graph and remove cells that have nearest neighbors in multiple cell-type clusters.

alex-d13 commented 2 years ago

Hi,

I have another question on sampling the cells which just came up: Should the total number of cells be the same for all pseudo-bulk samples? Or should that be an option, together with the cell-type fractions, to be modular for each simulated sample?

FFinotello commented 2 years ago

I'd say that the number of cells and total read counts per sample should also be input parameters (possibly to be specified also on a sample-by-sample manner)

omnideconv / SimBu

Simulation Implementation #5