Open yiolino opened 2 years ago
Hi,
We probably are going to revamp the bit kmers we have as we just built a pretty huge project using needletail and ended up re-implementing the bit encoding to be more flexible. You're right that it would need u128 instead of u64 for storing kmers > 31bp. For now I would recommend writing your own 2bit encoding function returning a u128 and do not use the built-in bit kmers iterator.
@Keats Thank you for your reply.
I understand and looking forward to re-implementing the bit encoding!
Regards,
You can check kmers bit encoding https://github.com/COMBINE-lab/kmers
I will have to take a look at that @natir ! In our program we have some stuff that wouldn't make sense in a public library (eg we do some 3 bit encoding as well to encode ATCG$). I'm leaning toward adding some basic encoding utilities to needletail and let people do whatever they need on the raw sequence rather than having a built-in opinionated iterator.
@natir I'll use that. Thank you!
Sure @Keats my message is more for @tetsuro90 than you, maybe kmers match to @tetsuro90 requirement.
Yeah I understood, I just meant that I ned to look at kmers before doing changes to needletail to see if we can consolidate somehow
Thank you for great software.
I want to use
bit_kmer
function to count k-mer in fastq file. But, if k-mer length is longer than 32 bases, it will panic.I think this is because the bit_kmer sequence is represented as a u64 type (
type BitKmerSeq = u64
), but is there a method to perform k-mer counts over 32 bp?I am using HashMap as a database for k-mer counts, but if I use fastq files as input, HashMap becomes too large. So, I would like to use bit_kmer to reduce it as much as possible. Is there an alternative method that could be considered, such as "use u128 type"?
Regards, tetsuro90