seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
396 stars 81 forks source link

The shape size for the kmer_hash. #3242

Closed Jappy0 closed 4 months ago

Jappy0 commented 4 months ago

Hi,

Many thanks for this excellent library in computational biology. I have a question when I was using the API. Please see my question below.

Platform

Question

I wanted to hash each kmer of a sequence using the hash method in Seqan3. The used kmer size is 28, so the shape is also 28. but

"terminate called after throwing an instance of 'std::invalid_argument' what(): The chosen shape/alphabet combination is not valid. The alphabet or shape size must be reduced."

However, I read the documentation; it says the maximum size of the shape is 58.

So, is all my understanding correct? or is this a bug?

#include <seqan3/search/views/kmer_hash.hpp>

std::vector<seqan3::dna5> read{"CGCTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTGCGTCGTCATCGCTGTCATCCCACATGCCCAGCAGTGCTGATCTGGGACAGCTTGGCGTAACTAGA"_dna5};
unit8_t k=28;
for (size_t i = 0; i < read.size() - k + 1; ++i){
        auto kmer = std::vector<seqan3::dna5>(read.begin() + i, read.begin() + i + k);
        auto hash_val_range = kmer | seqan3::views::kmer_hash(seqan3::ungapped{static_cast<uint8_t>(k)});
}

Thanks for your time.

Best regards,

Jappy

SGSSGene commented 4 months ago

Hi, thank you for reaching out.

As far as I see there are two sources:

  1. Probably the one you are referring to Shape
  2. An one inside the kmer_hash Which warns against the exact exception you get.

@eseiler and @smehringer This seems to be much more your field of expertise than mine. What do you think about this?

Complete working example, including the exception being thrown: ```c++ #include #include using namespace seqan3::literals; int main() { std::vector read{"CGCTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTGCGTCGTCATCGCTGTCATCCCACATGCCCAGCAGTGCTGATCTGGGACAGCTTGGCGTAACTAGA"_dna5}; uint8_t k=28; for (size_t i = 0; i < read.size() - k + 1; ++i){ auto kmer = std::vector(read.begin() + i, read.begin() + i + k); auto hash_val_range = kmer | seqan3::views::kmer_hash(seqan3::ungapped{static_cast(k)}); } } ```
SGSSGene commented 4 months ago

Independent of the documentation.

A k-mer over dna5 will require at least log_2(5) ≈ 2.3219 bits, which means shapes with max 27 ones are possible. If you changed it to dna4, you could go up to 32 ones inside your shape.

Jappy0 commented 4 months ago

Independent of the documentation.

A k-mer over dna5 will require at least log_2(5) ≈ 2.3219 bits, which means shapes with max 27 ones are possible. If you changed it to dna4, you could go up to 32 ones inside your shape.

Thanks so much for the explanation. I now have a better understanding.