pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
631 stars 168 forks source link

Multimapping reads in Kallisto #339

Open kikegoni opened 2 years ago

kikegoni commented 2 years ago

Hey,

First of all thanks for developing and implementing Kallisto. It is an amazing tool!

I just want to confirm a couple of things regarding the assignment of reads mapping to two transcripts from two different genes.

1) In bulk RNA-seq mode, kallisto quant by default applies the EM algorithm to split the read distribution across the two different genes. Is it right? Also, is here a way to avoid counting these multimapping reads in bulk RNA-seq?

2) In scRNA-seq, by default, both kallisto bus and bustools count discard these multimapping reads, right? But in the bustools count step one can apply the -em or -m algorithm to make this assignment. Is that right?

Sorry for the sort of basic questions, I am comparing the performance of Kallisto to other methodologies and want to understand if some little differences are due to divergences in multimapping approaches.

Thanks in advances for all your help and again for implementing Kallisto and the bustools suite!!

Best,

Kike

Yenaled commented 2 years ago

Hello!

1) Correct, the EM algorithm is applied in kallisto quant. There's no way to avoid counting these multimapping reads in kallisto quant; doing would be a bad idea (it would underestimate transcript expression levels and give you less accurate gene expression estimates).

2) In scRNA-seq, by default, reads that map to multiple genes are discarded which, albeit not ideal, is generally fine given the sparsity of the data and the fact that everything is 3' end-tagged unlike bulk. We don't use the "bulk" EM algorithm in scRNA-seq because the EM algorithm tries to resolve transcript-level estimates (which you can't really resolve when you're only sequencing the 3' end) and because the EM algorithm tries to normalize for transcript length (which, again, is not applicable because you're only sequencing the 3' end). Instead, what bustools does is UMI-counting. You can use -m or --em to distribute UMI counts across multimapping genes (note: the --em here is different because than the bulk algorithm because it's gene-level, not transcript-level).

If you want to experiment around, however, you do have the option of quantifying bulk data with kallisto bus. Just run "kallisto bus" and put your fastq files after the command -- now you have your bulk data in a bus file and you can try out the various different ways of quantifying quantifying the data from a bus file (note: since bulk data doesn't have UMIs, be sure to use --cm when running "bustools count" which basically ignores UMI collapsing). Along the same lines, if you really want to try running the bulk EM algorithm using kallisto bus, you can run "bustools count" with -m and without the --genecounts option. This will output a TCC matrix which you can feed into the command: "kallisto quant-tcc" (which runs the bulk EM algorithm). We are still in the process if documenting these new features (putting bulk data into a bus file and running the bulk EM algorithm from a bus file) but I'm happy to answer any questions in the meantime.

kikegoni commented 2 years ago

Hey @Yenaled ,

Thanks a lot for your fast answer!!! Perfect, understood!! I will experiment around with kallisto bus applied to bulk RNA-seq!

Best,

Kike