sequences dropped from the index

fklirono commented 5 years ago

Hello,

kallisto (0.44.0) seems to be silently dropping sequences from the index.

Working example:

download circBase hg19 circRNA sequences.
count them (140790)
index them and count again how many targets are included in the index (92509)

Is there a reason why some sequences are not indexed?

Code to reproduce example:

wget 'http://www.circbase.org/download/human_hg19_circRNAs_putative_spliced_sequence.fa.gz' | gzip -d -c > human_hg19_circRNAs_putative_spliced_sequence.fa

sed -n '/^>/p' human_hg19_circRNAs_putative_spliced_sequence.fa | wc -l 

kallisto index -i human_hg19_circRNAs_putative_spliced_sequence.fa.fai human_hg19_circRNAs_putative_spliced_sequence.fa

kallisto inspect human_hg19_circRNAs_putative_spliced_sequence.fa.fai

pmelsted commented 5 years ago

There are 58 sequences that are empty seqtk comp human_hg19_circRNAs_putative_spliced_sequence.fa.gz | awk '$2==0 {print NR}' | wc -l The first one is on sequence number 92510. kallisto index stops reading through the fasta file once it encounters an empty sequence, but it should just skip it. I'll fix this in the next release

If you remove the empty sequences kallisto will index the rest. Note that if you are working with circular rna you might want to circularize the fasta file based on the kmer length by adding the first k-1 bases of the fasta sequence to the end. That way kallisto will find k-mers that span the junction as well.

fklirono commented 5 years ago

Thanks for the very fast response! I also noticed the empty sequences just now when I started looking into this.

Thanks also for the suggestion to circularize the sequences. I am experimenting with kallisto and Oxford nanopore long reads for circRNAs. Any more suggestions would be very welcome!

pachterlab / kallisto

sequences dropped from the index #205