refresh-bio / KMC

Fast and frugal disk based k-mer counter
253 stars 73 forks source link

is_allowed question and possible bug #152

Open jhh67 opened 3 years ago

jhh67 commented 3 years ago

I was looking at the is_allowed() method in mmer.h trying determine its purpose. The following line has a bug either in the code or the comment, not sure which. As it is the conditional matches against TT*.

if ((mmer & 0x3c) == 0x3c) // TG* suffix

should be

if ((mmer & 0x3c) == 0x38) // TG* suffix

marekkokot commented 3 years ago

You are right, it is a bug, but its consequences are not so obvious. In fact, filtering out some suffixes of m-mers was left in the code accidentally. We tried to remove the suffix filtering completely but it turned out, super k-mers are better distributed when we filter out suffixes. I added a comment to this line, but I would not like to change either code or comment. As far as I remember (this is an old code now) we meant TG* as a comment suggest, but if I would change the code if would introduce the following issues:

jhh67 commented 3 years ago

Thanks for the quick response. I was looking through the code to figure out how kmers are assigned to bins and came across the bug when I added some print statements. I figured it didn't affect correctness, only efficiency. Do you have any documentation on the code itself other than the comments?

John

On Wed, Sep 30, 2020 at 11:55 PM marekkokot notifications@github.com wrote:

You are right, it is a bug, but its consequences are not so obvious. In fact, filtering out some suffixes of m-mers was left in the code accidentally. We tried to remove the suffix filtering completely but it turned out, super k-mers are better distributed when we filter out suffixes. I added a comment to this line, but I would not like to change either code or comment. As far as I remember (this is an old code now) we meant TG* as a comment suggest, but if I would change the code if would introduce the following issues:

  • it is not clear if the code would work better or worse (I suspect the change would be meaningless, but it should be tested)
  • KMC databases (counted k-mers) created before the hypothetical bug fix couldn't be queried with a new (fixed) version of KMC API. For those reasons (an lack of time to inspect if deeper) I decided to only add a comment. The important thing is that this bug does not influence the result, i.e. the result is correct (in the sense that all k-mers that should be counted are counted and the counts are also valid). Anyway, thanks for noticing this, it seems you are quite deep in our code if you spotted such an issue. In case of any questions, don't hesitate to ask.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/refresh-bio/KMC/issues/152#issuecomment-701929478, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYK4ZGJ77VFLJS73FATHYDSIQRXTANCNFSM4R7W6P6A .

marekkokot commented 3 years ago

I am afraid the code is not well documented. We have only user documentation for API and about database format, and user documentation for kmc_tools. But if you have any questions about code try to ask here maybe I will be able to help.