Closed mkellom closed 2 years ago
Hello,
thanks for using KMC and reporting this issue. I tried to reproduce this on some random fastq file I have, but I get all 136 tetramers:
../KMC3.1.1.linux/kmc_tools -t60 -v transform o dump o.dump
in1: 100%
wc -l o.dump
136 o.dump
Maybe it is somehow data-specific. Would it be possible to share your input data (or, ideally, the smallest part of it causing the issue)?
Sure, but I don't know what part of the input is causing the issue. Here is a download link for the fastq.gz input file as well as the outputs: https://drive.google.com/drive/folders/1VQ0vx05Dx34W5n9J1Qfv9nlKhLUjajnR?usp=sharing
Thank you!
Hi,
thanks for the file. The last commit should fix the issue. In fact, if you would use -cx4294967295 it would also work, although of course, this should also work for -cx1000000000000000. Now it does.
I would like to keep this issue opened because it is related to a more complex design issue.
I will write info for myself when I have time to fix it more elegantly.
For small k the counter may be of larger size than 4 bytes.
kmc_tools
currently does not support that. Although it will rarely be practical case to have so huge counters it may be worth to redesign types used for cutoffs and for counters. Currently, there is a not very nice workaround for cutoff_max
, i.e. in the kmc_pre
file the cutoff_max
is divided into hi
and lo
parts, the lo
part is as in documentation the hi
part may be set to values different than zeros only when small k optimization is on. Then the first 4 bytes of uint32_t tmp[6] are used to store hi
part (this is not documented!). Redesign here would probably require redesigning database format (changing uint32_t
types into uint64_t
. In the implementation, it may be worth using templates for a counter type to not risk performance differences.
Thank you! If there was an issue with counters for small k, does that mean that my counts where 136 tetramers were reported may be incorrect?
That's unlikely. The problem was not with KMC itself, but with kmc_tools that, instead of reading the cutoff as 1000000000000000, was reading it as 2,764,472,320. When the AAAA k-mer occurs 3,482,315,130 times it was ignored when reading KMC database by kmc_tools.
Ah I see. Thank you!
Do the binaries at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download contain the high cutoff update?
Nope, in general, you should rather use GitHub release page: https://github.com/refresh-bio/KMC/releases/ but it also does not contains an updated version. For now, it would be best to compile from sources. I will create a new release version, but there are other improvements that I'm planing for the near (I hope) future, so this is why I am not creating a new release right now. If you cannot compile on your own let me know I will attach compiled version here.
Got it, thank you again.
I wanted to confirm for you that the fix does indeed output all 136 tetramers for the ~50 metagenomes that were originally dumping less.
Great! Thanks!
Hi, I'm using KMC to get tetramer counts from metagenomes. After using kmc_tools dump, I seem to be missing some tetramers. These are huge filtered read files from metagenomes so each of the 136 canonical tetramers should be present but I occasionally get only 135, 134, etc. I have increased the -cs and -cx flags to well above what the counts should be. I'm using these commands:
kmc -v -k4 -m120 -t61 -cs1000000000000000 -cx1000000000000000 INPUT.fastq.gz OUTPUT_FILENAME WORKING_DIR kmc_tools -t60 -v transform INPUT_FILE dump OUTPUT_FILE
This is the output I get from the kmc command:
Info: Small k optimization on!
configuration for small k mode: No. of input files : 1 Output file name : OUTPUT_FILENAME Input format : FASTQ
k-mer length : 4 Max. k-mer length : 256 Min. count threshold : 2 Max. count threshold : 1000000000000000 Max. counter value : 1000000000000000 Both strands : true Input buffer size : 33554432
No. of readers : 1 No. of splitters : 60
Max. mem. size : 120000MB
Max. mem. for PMM (FASTQ) : 5278MB Part. mem. for PMM (FASTQ) : 33MB Max. mem. for PMM (reads) : 94MB Part. mem. for PMM (reads) : 0MB Max. mem. for PMM (b. reader): 402MB Part. mem. for PMM (b. reader): 134MB
Stage 1: 100% 1st stage: 947.269s 2nd stage: 0.049529s Total : 947.318s Tmp size : 0MB
Stats: No. of k-mers below min. threshold : 0 No. of k-mers above max. threshold : 0 No. of unique k-mers : 136 No. of unique counted k-mers : 136 Total no. of k-mers : 147308145862 Total no. of reads : 1002712746 Total no. of super-k-mers : 0
It seems that it is finding 136 unique tetramers but it isn't getting dumped. I'm not sure why. Any ideas?
Thank you!