pachterlab / kallisto-D

BSD 2-Clause "Simplified" License
5 stars 3 forks source link

Build D-referenced index #7

Open kikegoni opened 1 year ago

kikegoni commented 1 year ago

Hey,

First of all thanks for this new implementation of the kallisto software with the D-reference you included in your recent paper "Accurate quantification of single-nucleus and single-cell RNA-seq transcripts" . https://www.biorxiv.org/content/10.1101/2022.12.02.518832v1

I am trying to build the Kallisto index including the D-sequences you're including in the paper but I do not know exactly how to do it. From the paper it says that you use all GRCh38 scaffolds. My concern is whether intronic sequences should be included also.

Thanks a lot for any help you might be able to provide.

Best,

Kike

Yenaled commented 1 year ago

Hello! Thanks for reading our paper! Eventually, kallisto-D will be merged into the original kallisto branch (we're currently revising the manuscript and adding/fixing some analyses, so be on the lookout for version 2 of the paper in the coming months!). kallisto-D is still under very active development (mostly adding new features and trying to maintain backwards compatibility) but the numbers you get from quantification should remain the same as (or at least nearly the same as) the eventual final release.

When running kallisto index for the purpose of quantifying single-cell data (which are predominantly mature/spliced transcripts), you can simply supply the entire genome FASTA to the D-list option (which is what we did).

kikegoni commented 1 year ago

Hey,

Thanks a lot for your fast answer. From the paper I understood that what you do is adding flanking k-mers to both ends of the exons in order to remove reads falling there (Figure 1C in your paper). So is It ok If I assume that the -d option when I add the entire genome FASTA (not sure if with or without scaffolds) automatically generates the flanking k-mers surrounding the exon?

Best, Kike

Yenaled commented 1 year ago

Yes, if you add the entire genome FASTA to the -d option, the flanking k-mers around the 'spliced transcript' will be generated, and those k-mers can distinguish reads originating from actual spliced transcripts (aka your index) vs. those that appear elsewhere in the genome. This makes read mapping more accurate.

Basically, when adding the entire genome FASTA to the -d option, a scan is done through that entire FASTA file to determine what k-mers should be added to the d-list (i.e. we look at sequences unique to the genome FASTA [aka those not in the cDNA index]).

kikegoni commented 1 year ago

Perfect, understood!! I will give it a try in the upcoming days! Seems very useful!

Thanks a lot for your detailed explanation.

Best,

Kike