Getting different counts when running the standard workflow vs the nucleus workflow

krtitus commented 1 year ago

Hello,

I am currently analyzing bulk, nuclear RNA-seq data using this tutorial

I have run the following command to get the reference index for Kallisto, which ran with no problem:

kb ref -i index.idx -g t2g.txt -f1 cdna.fa -f2 intron.fa -c1 cdna_t2c.txt -c2 intron_t2c.txt --workflow nucleus GRCh38.primary_assembly.genome.fa gencode.v42.chr_patch_hapl_scaff.annotation.gtf

Then, the tutorial mentions using --workflow nucleus to count the pseudoaligned reads

kb count -i index.idx -g t2g.txt -c1 cdna_t2c.txt -c2 intron_t2c.txt -x Bulk -o output -t 16 --h5ad --workflow nucleus --parity paired batch.txt

I looked into the code to see what this workflow is doing, and from my understanding, the nucleus workflow runs the function count_velocity. The documentation for count_velocity in count.py says "Generates RNA velocity matrices for single-cell RNA seq". In the tutorial, it also says that the nucleus workflow merges the spliced and unspliced matrices into a single counts matrix, but I'm not sure if this is correct for my dataset.

With that said, I also ran kb count using --workflow standard , which runs the count function. In the documentation for count it says "Generates count matrices for single-cell RNA seq." When I re-run with the new flag it doesn't produce an error or anything, but what I noticed that the counts are generally higher under this workflow than the nucleus workflow.

Is there an explanation for this discrepancy? If so, which workflow should I use if I'm only interested in counts?

Yenaled commented 1 year ago

Do you want counts of nascent transcripts (which exist in the nucleus)? Then use workflow nucleus. Do you want counts of mature processed transcripts (which exist in cytoplasm)? Then use workflow standard.

krtitus commented 1 year ago

Hi @Yenaled, I am interested in counting both exon and intron reads of nuclear transcripts, so I suppose I'll stick with workflow nucleus then. However, it still remains unclear to me conceptually why the workflow standard results in higher counts at certain genes than workflow nucleus.

Yenaled commented 1 year ago

It's counterintuitive since in many cases, you're adding reads mapped to the exon of the gene PLUS the reads mapped to the intron of the gene which would result on higher counts for that gene in nucleus workflow. But it's not as simple as 1+1=2.

Exons represent a tiny portion of the genome, nuclear transcripts represent a more substantial portion.

Therefore, with nucleus workflow, there is a higher likelihood of read ambiguity and also certain reads being better assigned elsewhere (e.g. to another gene).

As a thought experiment: Let's say the only thing in your index is a 31-bp exonic sequence [call this workflow standard]. If you run kallisto on that, you'll get high counts (basically every read containing the 31-bp sequence). Now let's say you index that 31-bp with the entire genome (including all nuclear transcripts) [workflow nucleus]. You'll get lower counts for that exon because some reads having a 31-bp overlap with that exon will not be counted as an exonic read for that particular gene (either because those reads map better to another gene, possibly a nuclear transcript, in the genome or because those reads map well to both the 31-bp exon and the other gene and hence be classified as ambiguous). Thus, workflow nucleus can result in lower counts than workflow standard.

We have a preprint out on this (how to accurately quantify nucleus vs cellular transcripts) which we're currently revising for version 2 of the preprint.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

pachterlab / kb_python

Getting different counts when running the standard workflow vs the nucleus workflow #198