Closed jamshed closed 2 years ago
Hello Jamshed,
I have tried to reproduce this using the current master branch, but it seems the results are correct:
KMC/bin/kmc -t16 -k1000 -ci1 -fm GCF_000005845.2_ASM584v2_genomic.fna o .
**
Stage 1: 100%
Stage 2: 100%
1st stage: 0.225505s
2nd stage: 6.39249s
Total : 6.618s
Tmp size : 6MB
Stats:
No. of k-mers below min. threshold : 0
No. of k-mers above max. threshold : 0
No. of unique k-mers : 4626637
No. of unique counted k-mers : 4626637
Total no. of k-mers : 4640653
Total no. of sequences : 1
Total no. of super-k-mers : 22200
Could you please show me the full command line and also confirm if you are using the same code as I do (current master branch).
Best, Marek
Hi Marek,
I'm using the latest code from the master branch. I'm using the following command:
k=1001; /usr/bin/time ./KMC/bin/kmc -v -k$k -fm -ci1 -t1 ./ecoli.fna /mnt/scratch4/jamshed/ecoli.k$k.kmc ./temp/
This is the output log:
**
********** Used parameters for Stage 1 : **********
No. of input files : 1
Output file name :
No. of working directories : 1
Input format : MULTI LINE FASTA
Output format : KMC
k-mer length : 1001
Max. k-mer length : 1024
Signature length : 9
Both strands : true
RAM only mode : false
******* Stage 1 configuration: *******
No. of bins : 512
Bin part size : 65536
Input buffer size : 16777216
No. of readers : 1
No. of splitters : 1
Max. mem. size : 12000MB
Max. mem. per storer : 6192MB
Max. mem. for single package : 24MB
Max. mem. for PMM (bin parts): 9526MB
Max. mem. for PMM (FASTQ) : 1650MB
Max. mem. for PMM (reads) : 1MB
Max. mem. for PMM (b. reader): 805MB
Stage 1: 100%
********** Used parameters for Stage 2 : **********
Min. count threshold : 1
Max. count threshold : 1000000000
Max. counter value : 255
******* Stage 2 configuration: *******
No. of threads : 1
Max. mem. for 2nd stage : 1098MB
Stage 2: 100%
1st stage: 0.88179s
2nd stage: 7.46185s
Total : 8.34364s
Tmp size : 6MB
Stats:
No. of k-mers below min. threshold : 0
No. of k-mers above max. threshold : 0
No. of unique k-mers : 3951266
No. of unique counted k-mers : 3951266
Total no. of k-mers : 3951267
Total no. of sequences : 1
Total no. of super-k-mers : 22181
8.17user 1.45system 0:08.73elapsed 110%CPU (0avgtext+0avgdata 917660maxresident)k
0inputs+1952800outputs (0major+379890minor)pagefaults 0swaps
Hi,
thanks. I think I know the cause that I have different results.
I modified kmc_core/defs.h
to support larger k, while you did it by compiler flags.
I'm not sure why your approach does not work, I mean there must be a bug somewhere in the code.
Anyway, could you please alter the appropriate #define
in kmc_core/defs.h
?
Hello Marek,
Thanks. I made a clean installation by directly setting MAX_K
inside the source code, and it seems to work correctly! But when I set it using make CFLAGS=-DMAX_K=1024
, the behavior becomes buggy.
Hello Jamshed,
it seems the bug was quite complex to find, but I think I have it now and it's fixed with 726ecbf75a71c9826a41f138da3635e81ba93895. Let me know if it works, and if it does please close the issue.
Thanks Marek! It works correctly on some example datasets where the earlier commits produced incorrect results.
Hello Marek,
I was wondering if KMC could be used to enumerate very large
k
-mers, e.g. withk
>= 1000. The need arose from this issue: https://github.com/COMBINE-lab/cuttlefish/issues/22.To support KMC upto
k
= 1024, I've compiled it usingmake CFLAGS=-DMAX_K=1024
. Now, some test results for the E. Coli genome reference:With
k
= 31:With
k
= 501:With
k
= 1001:Note that, we can compute the total length of the input sequence from the KMC stats results as the following quantity:
Total no. of k-mers + Total no. of sequences * (k - 1)
. Withk
= 31, 501, and 1001, we get the input lengths as 4641652, 4641652, and 3952267. If the input sequences do not have anyN
/ indeterminate characters, then this quantity should be the same for differentk
's. And we observe a mismatching value withk
= 1001.On top of that, I'm also observing different (and indeterminate) results for
k
= 1001 if the number of threads is increased beyond 1. For example, the following are two different runs' results with-t16
:I wonder if you might have any insights on this behavior?
Regards!