refresh-bio / KMC

Fast and frugal disk based k-mer counter
277 stars 72 forks source link

consistency check of the input format #183

Open tlemane opened 2 years ago

tlemane commented 2 years ago

Hello,

Thank you for developing kmc.

I ran into an issue today before I realized I was using the wrong flag. When using -fa instead of -fm, kmc (v3.2.1) runs smoothly but obviously produces incorrect results.

Here is an example on a multiline fasta:

-fa:

Total    : 14.8039s
Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :           45
   No. of unique counted k-mers       :           45
   Total no. of k-mers                :           45
   Total no. of reads                 :            1
   Total no. of super-k-mers          :            7

-fm:

Total    : 31.5231s
Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :    691228814
   No. of unique counted k-mers       :    691228814
   Total no. of k-mers                :   2136937309
   Total no. of sequences             :     28120374
   Total no. of super-k-mers          :    255510539

I think it could be useful to add a quick consistency check.

Best, Téo

marekkokot commented 2 years ago

Hi,

sorry for the very late response. Indeed, there are some issues related to the parsing :( We plan to rebuild this module and get rid of such bugs. Thank you anyway for reporting. KMC should report the wrong input format in this case. The problem is that the parsing module is so highly optimized that even adding a simple check is kinda risky. Rebuilding the parsing is high on our priority list, but there are things that are higher. Anyway, thanks again.

tlemane commented 2 years ago

Hi,

Yes unfortunately safe parsing probably involves an undesirable overhead. I was thinking more about a quick consistency check on few lines before parsing. Of course, this doesn't guard against ill-formed files but it does prevent some bad usages.

marekkokot commented 2 years ago

Hi,

this is a nice tradeoff, I will keep this issue open to be sure that after our rebuilding of the input reading module the problem if fixed. If does not we will probably implement such simple verification. Thanks!