refresh-bio / KMC

Fast and frugal disk based k-mer counter
253 stars 73 forks source link

Missing 4mers from metagenomes #170

Closed mkellom closed 2 years ago

mkellom commented 2 years ago

Hi, I'm using KMC to get tetramer counts from metagenomes. After using kmc_tools dump, I seem to be missing some tetramers. These are huge filtered read files from metagenomes so each of the 136 canonical tetramers should be present but I occasionally get only 135, 134, etc. I have increased the -cs and -cx flags to well above what the counts should be. I'm using these commands:

kmc -v -k4 -m120 -t61 -cs1000000000000000 -cx1000000000000000 INPUT.fastq.gz OUTPUT_FILENAME WORKING_DIR kmc_tools -t60 -v transform INPUT_FILE dump OUTPUT_FILE

This is the output I get from the kmc command:

Info: Small k optimization on!

configuration for small k mode: No. of input files : 1 Output file name : OUTPUT_FILENAME Input format : FASTQ

k-mer length : 4 Max. k-mer length : 256 Min. count threshold : 2 Max. count threshold : 1000000000000000 Max. counter value : 1000000000000000 Both strands : true Input buffer size : 33554432

No. of readers : 1 No. of splitters : 60

Max. mem. size : 120000MB

Max. mem. for PMM (FASTQ) : 5278MB Part. mem. for PMM (FASTQ) : 33MB Max. mem. for PMM (reads) : 94MB Part. mem. for PMM (reads) : 0MB Max. mem. for PMM (b. reader): 402MB Part. mem. for PMM (b. reader): 134MB

Stage 1: 100% 1st stage: 947.269s 2nd stage: 0.049529s Total : 947.318s Tmp size : 0MB

Stats: No. of k-mers below min. threshold : 0 No. of k-mers above max. threshold : 0 No. of unique k-mers : 136 No. of unique counted k-mers : 136 Total no. of k-mers : 147308145862 Total no. of reads : 1002712746 Total no. of super-k-mers : 0

It seems that it is finding 136 unique tetramers but it isn't getting dumped. I'm not sure why. Any ideas?

Thank you!

marekkokot commented 2 years ago

Hello,

thanks for using KMC and reporting this issue. I tried to reproduce this on some random fastq file I have, but I get all 136 tetramers:

../KMC3.1.1.linux/kmc_tools -t60 -v transform o dump o.dump
in1: 100%
wc -l o.dump
136 o.dump

Maybe it is somehow data-specific. Would it be possible to share your input data (or, ideally, the smallest part of it causing the issue)?

mkellom commented 2 years ago

Sure, but I don't know what part of the input is causing the issue. Here is a download link for the fastq.gz input file as well as the outputs: https://drive.google.com/drive/folders/1VQ0vx05Dx34W5n9J1Qfv9nlKhLUjajnR?usp=sharing

Thank you!

marekkokot commented 2 years ago

Hi,

thanks for the file. The last commit should fix the issue. In fact, if you would use -cx4294967295 it would also work, although of course, this should also work for -cx1000000000000000. Now it does.

I would like to keep this issue opened because it is related to a more complex design issue. I will write info for myself when I have time to fix it more elegantly. For small k the counter may be of larger size than 4 bytes. kmc_tools currently does not support that. Although it will rarely be practical case to have so huge counters it may be worth to redesign types used for cutoffs and for counters. Currently, there is a not very nice workaround for cutoff_max, i.e. in the kmc_pre file the cutoff_max is divided into hi and lo parts, the lo part is as in documentation the hi part may be set to values different than zeros only when small k optimization is on. Then the first 4 bytes of uint32_t tmp[6] are used to store hi part (this is not documented!). Redesign here would probably require redesigning database format (changing uint32_t types into uint64_t. In the implementation, it may be worth using templates for a counter type to not risk performance differences.

mkellom commented 2 years ago

Thank you! If there was an issue with counters for small k, does that mean that my counts where 136 tetramers were reported may be incorrect?

marekkokot commented 2 years ago

That's unlikely. The problem was not with KMC itself, but with kmc_tools that, instead of reading the cutoff as 1000000000000000, was reading it as 2,764,472,320. When the AAAA k-mer occurs 3,482,315,130 times it was ignored when reading KMC database by kmc_tools.

mkellom commented 2 years ago

Ah I see. Thank you!

mkellom commented 2 years ago

Do the binaries at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download contain the high cutoff update?

marekkokot commented 2 years ago

Nope, in general, you should rather use GitHub release page: https://github.com/refresh-bio/KMC/releases/ but it also does not contains an updated version. For now, it would be best to compile from sources. I will create a new release version, but there are other improvements that I'm planing for the near (I hope) future, so this is why I am not creating a new release right now. If you cannot compile on your own let me know I will attach compiled version here.

mkellom commented 2 years ago

Got it, thank you again.

mkellom commented 2 years ago

I wanted to confirm for you that the fix does indeed output all 136 tetramers for the ~50 metagenomes that were originally dumping less.

marekkokot commented 2 years ago

Great! Thanks!