pachterlab / kallistobustools

kallisto | bustools workflow for pre-processing single-cell RNA-seq data
https://kallistobus.tools/
MIT License
115 stars 29 forks source link

How does Kallisto bus tools handle multi-mapping reads? #15

Closed kikegoni closed 4 years ago

kikegoni commented 4 years ago

Hey,

First of all thanks for implementing the Kallisto | bustools for single cell RNA-seq analysis. I find it super helpful.

I was wondering how Kallisto | bustools deals with multi-mapping reads. From the GitHub manual it is explained that in the command

bustools count

there is an option to include reads mapping to multiple genes:

-m, --multimapping    Include bus records that pseudoalign to multiple genes

So from this we can assume that the default "bustools count" doesn't include the counts from multimapping reads, right? And that if I included the "-multimapping" option the multi-mapping reads would be split the counts across the genes it maps depending on the expression of each gene, right?

Thanks in advance,

Best,

Kike

gaofan83 commented 4 years ago

Hi Kike, Yes, under default mode, reads pseudoaligned to multiple genes are not counted. With -m option, a count will be split equally across pseudoaligned genes.

kikegoni commented 4 years ago

Hey @gaofan83 ,

Thanks for your answer. My concert is that, if we have a read that pseudoaligns to two different genes but with very different probabilities, would still be discarded? Let's say, after EM optimization, read1 pseudo aligns to geneA with 99% probability and to geneB with 1%. In the default mode, would that read be counted?

Thanks in advance,

Best,

Kike

gaofan83 commented 4 years ago

Hi Kike,

There is no “align with probability” for kallisto bus/ bustools. With regular rna seq the probabilities are found using the em algorithm and individual reads are not discarded. For single cell there is no em step.

kikegoni commented 4 years ago

Hey @gaofan83 ,

Thanks a lot for your explanations. Then Kallisto bus is only pseudoaligning and generating the output bus file and if a read maps to an equivalence class containing transcripts from two different genes in the "bustools count" step it would discard this read, right?

Would it correct it somehow if this reads was assigned to an equivalence class mapping 9 transcripts from geneA and 1 transcript from geneB? In that case, with the -m option, it would assigned 0.9 counts to geneA and 0.1 counts to geneB, right? Also, according to https://github.com/BUStools/bustools/releases it would be possible to perform EM with reads pseudoaligning to multiple genes with the option -em

The bustools count command adds the --em option that estimates gene abundances using an EM algorithm for reads that pseudoalign to multiple genes.

Note that the --multimapping option splits the read counts evenly across all genes, whereas the EM algorithm gives a more statistically valid answer. The two options are mutually exclusive.

Finally, just to clarify, if a UMI molecule pseudoaligns to an equivalence class containing multiple transcripts from the same gene, Kallisto | bustools would still count it just once, right?

Thanks a lot for all your help. I did not found any detailed explanation of what Kallisto and bustools are exactly doing.

Best,

Kike

gaofan83 commented 4 years ago

Had a discussion with bustools developer. With -m option the read is split between the two genes. Number of transcripts is irrelevant. Hope that helps.

joycekang commented 3 years ago

Hi! Thanks for developing this great tool. I would like to follow up on @kikegoni's last post above, particularly the part about the --em option in bustools:

The bustools count command adds the --em option that estimates gene abundances using an EM algorithm for reads that pseudoalign to multiple genes.

Note that the --multimapping option splits the read counts evenly across all genes, whereas the EM algorithm gives a more statistically valid answer. The two options are mutually exclusive.

Just wanted to confirm that this single-cell option does use the E-M step to estimate abundances (rather than splitting multi-mapping reads equally)? And, by that logic, it would typically always be preferred over the --multimapping option.

Thanks in advance!