pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
647 stars 170 forks source link

sequences dropped from the index #205

Closed fklirono closed 5 years ago

fklirono commented 5 years ago

Hello,

kallisto (0.44.0) seems to be silently dropping sequences from the index.

Working example:

Is there a reason why some sequences are not indexed?

Code to reproduce example:

wget 'http://www.circbase.org/download/human_hg19_circRNAs_putative_spliced_sequence.fa.gz' | gzip -d -c > human_hg19_circRNAs_putative_spliced_sequence.fa

sed -n '/^>/p' human_hg19_circRNAs_putative_spliced_sequence.fa | wc -l 

kallisto index -i human_hg19_circRNAs_putative_spliced_sequence.fa.fai human_hg19_circRNAs_putative_spliced_sequence.fa

kallisto inspect human_hg19_circRNAs_putative_spliced_sequence.fa.fai
pmelsted commented 5 years ago

There are 58 sequences that are empty seqtk comp human_hg19_circRNAs_putative_spliced_sequence.fa.gz | awk '$2==0 {print NR}' | wc -l The first one is on sequence number 92510. kallisto index stops reading through the fasta file once it encounters an empty sequence, but it should just skip it. I'll fix this in the next release

If you remove the empty sequences kallisto will index the rest. Note that if you are working with circular rna you might want to circularize the fasta file based on the kmer length by adding the first k-1 bases of the fasta sequence to the end. That way kallisto will find k-mers that span the junction as well.

fklirono commented 5 years ago

Thanks for the very fast response! I also noticed the empty sequences just now when I started looking into this.

Thanks also for the suggestion to circularize the sequences. I am experimenting with kallisto and Oxford nanopore long reads for circRNAs. Any more suggestions would be very welcome!