refresh-bio / KMC

Fast and frugal disk based k-mer counter
277 stars 72 forks source link

Bug with extra large `k`-mers #204

Closed jamshed closed 2 years ago

jamshed commented 2 years ago

Hello Marek,

I was wondering if KMC could be used to enumerate very large k-mers, e.g. with k >= 1000. The need arose from this issue: https://github.com/COMBINE-lab/cuttlefish/issues/22.

To support KMC upto k = 1024, I've compiled it using make CFLAGS=-DMAX_K=1024. Now, some test results for the E. Coli genome reference:

Note that, we can compute the total length of the input sequence from the KMC stats results as the following quantity: Total no. of k-mers + Total no. of sequences * (k - 1). With k = 31, 501, and 1001, we get the input lengths as 4641652, 4641652, and 3952267. If the input sequences do not have any N / indeterminate characters, then this quantity should be the same for different k's. And we observe a mismatching value with k = 1001.

On top of that, I'm also observing different (and indeterminate) results for k = 1001 if the number of threads is increased beyond 1. For example, the following are two different runs' results with -t16:

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :       109417
   No. of unique k-mers               :      4028568
   No. of unique counted k-mers       :      3919151
   Total no. of k-mers                : 304178492578579
   Total no. of sequences             :            1
   Total no. of super-k-mers          :        22181
Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :        48691
   No. of unique k-mers               :      4028568
   No. of unique counted k-mers       :      3979877
   Total no. of k-mers                : 135264335877777
   Total no. of sequences             :            1
   Total no. of super-k-mers          :        22181

I wonder if you might have any insights on this behavior?

Regards!

marekkokot commented 2 years ago

Hello Jamshed,

I have tried to reproduce this using the current master branch, but it seems the results are correct:

 KMC/bin/kmc -t16 -k1000 -ci1 -fm  GCF_000005845.2_ASM584v2_genomic.fna o .
**
Stage 1: 100%
Stage 2: 100%
1st stage: 0.225505s
2nd stage: 6.39249s
Total    : 6.618s
Tmp size : 6MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :      4626637
   No. of unique counted k-mers       :      4626637
   Total no. of k-mers                :      4640653
   Total no. of sequences             :            1
   Total no. of super-k-mers          :        22200

Could you please show me the full command line and also confirm if you are using the same code as I do (current master branch).

Best, Marek

jamshed commented 2 years ago

Hi Marek,

I'm using the latest code from the master branch. I'm using the following command:

k=1001; /usr/bin/time ./KMC/bin/kmc -v -k$k -fm -ci1 -t1 ./ecoli.fna /mnt/scratch4/jamshed/ecoli.k$k.kmc ./temp/

This is the output log:

**

********** Used parameters for Stage 1 : **********
No. of input files           : 1
Output file name             : 
No. of working directories   : 1
Input format                 : MULTI LINE FASTA
Output format                : KMC

k-mer length                 : 1001
Max. k-mer length            : 1024
Signature length             : 9
Both strands                 : true
RAM only mode                : false

******* Stage 1 configuration: *******

No. of bins                  : 512
Bin part size                : 65536
Input buffer size            : 16777216

No. of readers               : 1
No. of splitters             : 1

Max. mem. size               : 12000MB
Max. mem. per storer         :  6192MB
Max. mem. for single package :    24MB

Max. mem. for PMM (bin parts):  9526MB
Max. mem. for PMM (FASTQ)    :  1650MB
Max. mem. for PMM (reads)    :     1MB
Max. mem. for PMM (b. reader):   805MB

Stage 1: 100%

********** Used parameters for Stage 2 : **********
Min. count threshold         : 1
Max. count threshold         : 1000000000
Max. counter value           : 255

******* Stage 2 configuration: *******
No. of threads               : 1

Max. mem. for 2nd stage      :  1098MB

Stage 2: 100%
1st stage: 0.88179s
2nd stage: 7.46185s
Total    : 8.34364s
Tmp size : 6MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :      3951266
   No. of unique counted k-mers       :      3951266
   Total no. of k-mers                :      3951267
   Total no. of sequences             :            1
   Total no. of super-k-mers          :        22181
8.17user 1.45system 0:08.73elapsed 110%CPU (0avgtext+0avgdata 917660maxresident)k
0inputs+1952800outputs (0major+379890minor)pagefaults 0swaps
marekkokot commented 2 years ago

Hi,

thanks. I think I know the cause that I have different results. I modified kmc_core/defs.h to support larger k, while you did it by compiler flags. I'm not sure why your approach does not work, I mean there must be a bug somewhere in the code. Anyway, could you please alter the appropriate #define in kmc_core/defs.h ?

jamshed commented 2 years ago

Hello Marek,

Thanks. I made a clean installation by directly setting MAX_K inside the source code, and it seems to work correctly! But when I set it using make CFLAGS=-DMAX_K=1024, the behavior becomes buggy.

marekkokot commented 2 years ago

Hello Jamshed,

it seems the bug was quite complex to find, but I think I have it now and it's fixed with 726ecbf75a71c9826a41f138da3635e81ba93895. Let me know if it works, and if it does please close the issue.

jamshed commented 2 years ago

Thanks Marek! It works correctly on some example datasets where the earlier commits produced incorrect results.