refresh-bio / KMC

Fast and frugal disk based k-mer counter
253 stars 73 forks source link

canonical k mers #163

Open augustkx opened 3 years ago

augustkx commented 3 years ago

For canonical k mers, which ATCG combination is dropped? For example, AAA, TTT, the kmc will dump all TTT counting to AAA or vice versa? What's the rule? Thanks!

marekkokot commented 3 years ago

In KMC canonical means Lexicographically smaller, so in your example, AAA and TTT are treated as AAA.

augustkx commented 3 years ago

Thanks for your quick response! Can I ask another question: When I use something like: kmc -k8 -m24 -fm -ci0 -cs1677215 1280.29965.fna NA.res cano8mer kmc_dump -ci0 -cs1677215 NA.res merge_8mers_1280.29965.txt

The -ci0 setting seems not working, as the output does not include 0 counts.

marekkokot commented 3 years ago

Well, the rationale is that KMC counts only k-mers that are present in the input. One may argue that with -ci0 also absent k-mers should be in the output with count 0. It would work for small k (like in your case where k-mer counting is not computationally costly at all, in fact in KMC we have a special procedure for a case of small k). In most practical applications k is much greater (typically 20, 30 but there are also applications for k=~60). the number of all possible k-mers is 4^k (or roughly half of this value in the case of canonical k-mers), which turns into a big number for higher k. For example, for k=30 we have 1152921504606846976 = ~1 exa possible k-mers. When you make a dump each letter occupies one byte, so for all k-mers you would need 30 exabytes (in fact even more because counts must be stored also) on disk, not mentioning the time required to generate such a file. So let me reverse your question, why do you need zero counts? What's your use case?