refresh-bio / KMC

Fast and frugal disk based k-mer counter
277 stars 72 forks source link

CheckKmer incorrectly returns false if query is noncanonical. #150

Closed tbenavi1 closed 4 years ago

tbenavi1 commented 4 years ago

Hello, I have been using the C++ API to determine whether a kmer is present in my KMC database. However, the CheckKmer function will return false if the queried kmer is noncanonical.

For example, if I create a kmer database from the read "ACATTTCATTA" with -k5 and -ci1, and query for the kmer "ACATT", CheckKmer returns false, even though the kmer should be present in the database.

It would be helpful if CheckKmer could correctly account for noncanonical kmers, so that I can query for kmers from a read without having to canonicalize them.

Thanks for your assistance.

marekkokot commented 4 years ago

Hi, you are right. CheckKmer always checks for k-mer in a form given as a parameter. I am afraid to change this behavior due to possible backward compatibility issues.

I could create a function like CheckKmerCanonical that would first canonicalize given k-mer (you may do it yourself before calling CheckKmer).

There is also a better way. There is GetCountersForRead method in KMC API that checks if k-mer is in canonical form and transforms it if needed (unless you have used -b switch during k-mer counting). In one call you get a vector of counters for a given read. Code example:

    CKMCFile file;
    file.OpenForRA("o"); //assume success
    std::vector<uint32_t> v;
    file.GetCountersForRead("ACATTTCATTA", v);

    for (auto c : v)
        std::cout << c << " ";

Let me know if it fits your use case.

tbenavi1 commented 4 years ago

Thank you! This works for my use case. Perhaps the only thing I would suggest is adding a note to the API.pdf for this function that mentions the behavior for noncanonical kmers.