pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 24 forks source link

Counting overlapping genes #173

Closed jkniehaus closed 1 year ago

jkniehaus commented 1 year ago

Hello,

First off, thank you for your tool! I have a quick question about pseudo alignment.

I'm using kb_python to pseudoalign single-cell RNA-seq data generated from mouse tissue using SMARTSEQ2 technology. I'm curious to know how Kallisto quantifies overlapping genes/exons. A gene of interest contains exons that overlap with exons from an antisense gene. Given that SMARTSEQ2 isn't stranded, I wonder whether these reads are counted and, if so, how they are assigned to either the sense vs antisense gene.

I think gene/isoform assignment could be estimated by comparing the abundance of reads in non-overlapping regions. However, I haven't been able to determine whether this is the case with kallisto.

Additionally, if I use an RNA velocity workflow, would intronic reads in overlapping regions be assigned?

I came here with this question since I'm using kb_python, but could move over to the kallisto GitHub if that's more appropriate. Thanks! Jesse

ps. as a ref, here's the command I'm using to pseudoalign kb count --verbose \ -i /Jesse/references/transcriptome_UTR_GTE5_2.idx \ -g /Jesse/references/Kallisto/transcriptome/t2g_UTR_GTE5.txt \ -x SMARTSEQ2 \ --parity paired \ --tcc \ -o kallisto/ \ -t 60 \ -m 90G \ seq_batch.tsv

Yenaled commented 1 year ago

In unstranded mode, if a read maps to the sense strand of one gene and the antisense strand of another gene and the region of mapping is where the two genes completely overlap, that read will be counted as multimapping and the EM algorithm will divide up the counts among the two genes.

If there is a non-overlapping region that the read maps to, then that will allow the correct gene to be assigned to that read.

See the first figure of the kallisto paper to understand how multimapping works in kallisto (e.g. when transcripts overlap, how the set-intersection works for non-overlapping regions, etc.). The k-mer lookup is strand-agnostic (e.g. both the k-mer and its reverse complement are considered identical) -- so in stranded mode (like when we map 10X data), some extra work is done to account for strandedness.

jkniehaus commented 1 year ago

Thank you for the quick and clear response. Makes sense. I will look into the pubs for further details.