refresh-bio / KMC

Fast and frugal disk based k-mer counter
266 stars 73 forks source link

add option to count each kmer at most once per read #74

Open notestaff opened 6 years ago

notestaff commented 6 years ago

Currently, a kmer that occurs twice in the same read gets a count of 2, same as a kmer that occurs once in two different reads. But the latter kmer is more trustworthy, since it was observed in two independent reads. Or, if the kmers are not from reads but from assembled sequences of a given taxon, kmer occurrence in multiple sequences indicates conservation, while multiple kmer occurrences in one sequence indicates a repeat.

marekkokot commented 6 years ago

Such functionality was implemented in KMC1.0 (2014-03-28, http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=news) and it was further dropped. The algorithm was fundamentally changed in KMC2 and this functionality was not easy to implement in new version, also we were not sure if it is really needed. If you need this now you may use KMC1.0 (its database format is compatibile (or almost) with newer versions of KMC), but kmc1 is much slower and require much more disk space (especially for larger k vaules). We may reconsider adding this functionality in kmc3, but there are some technical problems and I am not sure when we will be able to do it.

Edit: BTW. Do you know any k-mer counter that supports such functionality (when I will implement it I would like to compare performance)