pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
201 stars 25 forks source link

KmerIterator undefined behaviour if string starts with N #60

Closed kreldjarn closed 2 years ago

kreldjarn commented 2 years ago

When iterating over a string that contains N, the KmerIterator should spool over the N and continue iterating from there. This works for the general case, but if the string starts with an N the KmerIterator returns a Kmer consisting of only A's, even though the next Kmer is a valid one containing no N's.

See reproducible example here:

int main(int argc, char* argv[]) {
    CompactedDBG<void> dbg(31);
    std::string seq1 = "NCATCACACACAGGGCTATTCCTTTCCTCCAATGAACCAA";
    std::string seq2 = "CNATCACACACAGGGCTATTCCTTTCCTCCAATGAACCAA";

    KmerIterator kit1(seq1.c_str());
    std::cout << kit1->first.toString() << std::endl;
    // AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    KmerIterator kit2(seq2.c_str());
    std::cout << kit2->first.toString() << std::endl;
    // ATCACACACAGGGCTATTCCTTTCCTCCAAT

    return 0;
}
kreldjarn commented 2 years ago

Should be fixed in 4b2917b

GuillaumeHolley commented 2 years ago

Hi @kreldjarn,

This is a very good catch, thanks! The issue stems from a modification I have made some time ago to the way the Kmer object is initialized and checked for "emptyness". I'll approve your PR asap.

Guillaume