Are some cDNAs omitted from the kallisto index?

jamesboot commented 3 years ago

Hi,

During some comparisons of STAR and kallisto we saw that the number of genes/transcripts that STAR aligns to and outputs is much larger than kallisto. This obviously makes sense and is expected, as STAR is aligning to the whole genome (human) whereas kallisto is only pseudo-aligning to the transcriptome/cDNA reference. However, on further investigation we found that there were approximately 20,000 human genes that STAR aligns to that are in the human cDNA reference - yet kallisto does not pseudo-align to these genes.

We were just wondering why this might be the case - are certain cDNA species omitted when the kallisto index is built? And if so why are some cDNA species omitted?

Thanks for your time in advance!

Yenaled commented 3 years ago

No cDNA species are omitted and you should not be observing such results (as they contradict the results of numerous benchmarks performed in the literature). Further, it is not true that it would make sense: an RNA-seq library consists of cDNA fragments (which is what kallisto maps against) and does not contain the whole genome. Can you show how you're building the kallisto index and which files you're using? And can you also show the kallisto pseudoalignment command(s) you're running and how you're summarizing transcript results to the gene-level?

jamesboot commented 3 years ago

For building kallisto index I'm using the Ensembl Homo_sapiens.GRCh38.cdna.all.fa , below is the command: kallisto index -i /path/to/index /path/to/Homo_sapiens.GRCh38.cdna.all.fa.gz

Running psuedo-alignment: kallisto quant -i /path/to/index/ \ -o $SUBDIR \ -b 100 \ -t 4 \ $READS1 $READS2

Using tximport to import transcript abundances and summarise to gene level.

Yenaled commented 3 years ago

Thanks the for the additional information. That Ensembl FASTA file does not contain any non-coding RNAs. Those will be missing (there are a little over 20,000 of those genes).

jamesboot commented 3 years ago

That makes sense, thanks for your help. Is there are cDNA FASTA file you would recommend using?

Yenaled commented 3 years ago

I recommend using our kb tool to create indices:

https://www.kallistobus.tools/kb_usage/kb_usage.html

Install kb-python and then download the genome primary assembly and GTF file to create a transcriptome index.

Alternately, you can simply append the non-coding RNA FASTA file to your cDNA FASTA file.

yfarjoun commented 1 year ago

I am in a similar situation where I see read counts over some genes using cellranger but not with Kallisto. This seems to be more true with genes that overlap other genes in the GTF, though it's not restricted to that subset of genes. Here's an informative example: I have 18837 reads with Cellranger and 0 with Kallisto over the gene TOMM6. In order to try understand why I'm getting no counts from Kallisto, I ran kallisto-quant with -genomebam and opened both that and the cellranger bam in igv:

As you can see, cell ranger has lots of reads, while kallisto doesn't.

the one hint I might have is that repeatmasker has a large repeat in one of TOMM6's introns (https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr6%3A41785425%2D41792135&hgsid=1589958155_arWO45JdxukXVHPADdI7ipyfW2y3) though I don't quite understand why a repeat in an intron should cause missed alignments.

I tried using the premade Kallisto reference and also one that I made from ensembl v109:

wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz 
kallisto index -i Homo_sapiens_GRCh38_ensembl_v109 Homo_sapiens.GRCh38.cdna.all.fa
wget https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/ensembl-96/t2g.py
wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.chr.gtf.gz
gunzip Homo_sapiens.GRCh38.109.chr.gtf.gz

kallisto index -i Homo_sapiens_GRCh38_ensembl_v109 Homo_sapiens.GRCh38.cdna.all.fa
./t2g.py  <Homo_sapiens.GRCh38.109.chr.gtf.gz > transcripts_to_genes.HomoSapiens.GRCh38.109.chr.txt

but the results were the same.

I'd love to hear your advice regarding how I might be able to get these read counts back to where they seem to be.

Many thanks,

Yossi.

Yenaled commented 1 year ago

Can you pull out a few of those TOMM6 reads and show me the sequences that aren't getting mapped?

I just ran kallisto on a few TOMM6 sequences (e.g. ATGGCTTCCAGCACTGTCCCGGTGAGCGCTGCTGGCTCGGCTAATGAAAC) and the alignment seemed fine.

Can you also show me the exact command you're using to map reads?

Also what's the t2g.py file?

yfarjoun commented 1 year ago

Thanks for the quick response!

I'm finding out if I can share some offending reads.

In the meantime I'm re-runing using the indexes from the Kallisto release. (ensembl 96?)

What option did you use to map the "few TOMM6" sequences? kallisto quant? I'm asking because when I tried with just one paired read I got a seg-fault.

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 2,343,263
[index] number of k-mers: 1,528,899,941
[index] number of equivalence classes: 10,634,088
Segmentation fault (core dumped)

Yenaled commented 1 year ago

For mapping, I just used kallisto quant (--single -l 1 -s 1). You can also use genomebam with this option. Try it with the ensembl v109 index. Make sure, when testing, your reads are in FASTQ format (including the quality score string being the same length as your read length).

yfarjoun commented 1 year ago

Thanks. I'll try it. You didn't need to use the multimapping or em options?

How many reads did you align? (I tried with just one read)

yfarjoun commented 1 year ago

Actually, upgrading to the latest version of kallisto seems to have alleviated this problem completely....sorry for the trouble.

Yenaled commented 1 year ago

Yay! No problem, glad to hear! :)

pachterlab / kallisto

Are some cDNAs omitted from the kallisto index? #324