onecodex / needletail

Fast FASTX parsing and k-mer methods in Rust
MIT License
174 stars 20 forks source link

bit_kmers function will panic at 'attempt to multiply with overflow', if k-mer length is longer than 32 bp. #58

Open yiolino opened 2 years ago

yiolino commented 2 years ago

Thank you for great software.

I want to use bit_kmer function to count k-mer in fastq file. But, if k-mer length is longer than 32 bases, it will panic.

I think this is because the bit_kmer sequence is represented as a u64 type (type BitKmerSeq = u64), but is there a method to perform k-mer counts over 32 bp?

I am using HashMap as a database for k-mer counts, but if I use fastq files as input, HashMap becomes too large. So, I would like to use bit_kmer to reduce it as much as possible. Is there an alternative method that could be considered, such as "use u128 type"?

Regards, tetsuro90

Keats commented 2 years ago

Hi,

We probably are going to revamp the bit kmers we have as we just built a pretty huge project using needletail and ended up re-implementing the bit encoding to be more flexible. You're right that it would need u128 instead of u64 for storing kmers > 31bp. For now I would recommend writing your own 2bit encoding function returning a u128 and do not use the built-in bit kmers iterator.

yiolino commented 2 years ago

@Keats Thank you for your reply.

I understand and looking forward to re-implementing the bit encoding!

Regards,

natir commented 2 years ago

You can check kmers bit encoding https://github.com/COMBINE-lab/kmers

Keats commented 2 years ago

I will have to take a look at that @natir ! In our program we have some stuff that wouldn't make sense in a public library (eg we do some 3 bit encoding as well to encode ATCG$). I'm leaning toward adding some basic encoding utilities to needletail and let people do whatever they need on the raw sequence rather than having a built-in opinionated iterator.

yiolino commented 2 years ago

@natir I'll use that. Thank you!

natir commented 2 years ago

Sure @Keats my message is more for @tetsuro90 than you, maybe kmers match to @tetsuro90 requirement.

Keats commented 2 years ago

Yeah I understood, I just meant that I ned to look at kmers before doing changes to needletail to see if we can consolidate somehow