pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
656 stars 172 forks source link

Non-ACGUT Characters #176

Open amanpatel101 opened 6 years ago

amanpatel101 commented 6 years ago

Kallisto seems to detect a large number of non-ACGUT characters when none exist. When I try to create an index using kallisto index, this is part of the output: [build] warning: replaced 5334758 non-ACGUT characters in the input sequence with pseudorandom nucleotides The counting kmers step also takes an exorbitantly long time.

I'm perplexed because comparably large indices have been created in a fraction of the time and with very few ambiguous characters. I have tracked an area of my database file where there is supposed to be one non-ACGUT character, and there certainly aren't any.

Any advice would be greatly appreciated. Thanks in advance!

mschilli87 commented 6 years ago

@amanpatel101: The shortest possible FASTA record without ACGUT that results in kallisto reporting > 0 non-ACGUT characters upon indexing this single transcript would likely help to understand and fix the issue. :wink:

thu1911 commented 5 years ago

I encountered similar problems. The reason for mine is that I used genome ref fasta but it should be transcipt ref fasta.