pachterlab / kallistobustools

kallisto | bustools workflow for pre-processing single-cell RNA-seq data
https://kallistobus.tools/
MIT License
115 stars 30 forks source link

kallisto bustools with reference transcriptome #45

Open MartaBenegas opened 2 years ago

MartaBenegas commented 2 years ago

Dear team,

I'm a little bit confused about the build index step. The manual says that it builds a transcriptome index but needs as input a genomic fasta and a gff. I would like to create the count table using a reference transcriptome. Is this possible with kallisto + bustools?

Thank you, Marta.

Yenaled commented 2 years ago

kb ref makes a reference transcriptome from a genome fasta and gtf.

If you already have a transcriptome, there's no need to use kb ref. Simply use kallisto index -i index.idx reference_transcriptome.fasta to create your index (index.idx).

MartaBenegas commented 2 years ago

Thanks for the explanation!

MartaBenegas commented 2 years ago

Dear Delaney, sorry for re-open the issue.

In order to use the kb count, I also need the transcript-to-gene mapping file. Which kind of file it is? Is it a tab file with transcript in one column and gene name in another?

Moreover, is there another option to perform the counting without using this file? I would like to use a de novo assembled transcriptome so I don't have this piece of information.

Thanks!

Yenaled commented 2 years ago

It's just a tab file with transcript in first column and gene name in second column.

You need this file to performing the counting -- but, if you want, you can pretend that each transcript is its own gene (i.e. put the transcript name in both columns).

The main issue is that kb count will discard all multimappers (i.e. if a UMI maps to more than 1 gene, that UMI will not be counted). Thus, multimapping might be a big issue if you pretend each transcript belongs to a different gene.

There are ways around this (e.g. if you use the --tcc option in kb count, an EM algorithm will try to probablistically figure out what to do with the multimappers). It basically boils down to: If you have a UMI associated with transcripts A, B, and C but have no gene-level information, how do you want to count that UMI?

MartaBenegas commented 2 years ago

Hi Delaney, thank you very much for your explanation! Now I see that multimappers are really an issue, I hadn't taken this fact into account so thank you for pointing that out!

Is there a way to not discard multimappers? And assign the count to the transcript with the most reliable alignment or something similar.

To explain my context a little bit, I'm working with a non-model organism and I've obtained my own curated reference transcriptome. Now I would like to use it for single-cell analysis, so I was searching for a counting algorithm that worked with a reference transcriptome. For the time being, I think I'll use your workaround to see how it behaves and maybe perform a sequence clustering to my transcriptome prior to the counting. I know it's not the perfect procedure, but I'll let you know how it goes :)