refresh-bio / KMC

Fast and frugal disk based k-mer counter
266 stars 73 forks source link

non-canonical forward kmer count is different than kmer count of forward+reverse #89

Closed tolot27 closed 5 years ago

tolot27 commented 6 years ago

Why is the non-canonical kmer count (parameter -b) of a sequence different than the canonical kmer count of the sequence plus its reverse complement? According to #46 for each kmer its reverse complement is calculated.

marekkokot commented 6 years ago

Hi,

since I don't know your input and exact command line used I will try to explain what may cause this in general. Possible reason 1: In default mode, KMC drops k-mers with occurring less than twice (may be overriden with -ci parameter). Lets assume there are following k-mers in the input:

ACCA
ACCA
TGGT

If you run KMC without -b option it will count

ACCA 3

If -b is used it would produce

ACCA 2

(notice that TGGT 1 will be absent since it only occurs once in the input). Possible reason 2: In default KMC store counters of k-mers on one byte, which means the maximum value of a counter is 255 (may be overriden with -cs parameter). Now, consider an input file containting 1000 k-mers AAAA and 700 k-mers TTTT. If you run KMC without -b the result is:

AAAA 255

If you run KMC with -b parameter the result will be:

AAAA 255
TTTT 255

Maybe there are some other reasons, but I don't know them right now. If you run KMC with -ci1 and -cs with some high value and it still bevahes the way you described, it may be in fact a bug. In such case, could you please provide some example input file and command line causing this?

notestaff commented 6 years ago

Kmers that are reverse complements of themselves are counted twice with -b but once without, right?

marekkokot commented 6 years ago

@notestaff No, they are not. Should they? Why?

marekkokot commented 5 years ago

Due to lack of activity, I am closing this issue, if needed please reopen this issue and supply some more info.

albanmathieu-pro commented 1 year ago

Hi @marekkokot , just reopening because i have a general question about the canonical form of a kmer. I want to compare 2 kmc datasets of results, does kmc keep the same canonical form ? Or is there a risk that the dataset 1 has one form and the other datasets the second form of the kmer ?

Thanks

albanmathieu-pro commented 1 year ago

ok found in the first publication: Usually we should not distinguish between a k-mer and its reversed complement, and by the “canonical k-mer” we will mean the lexicographically smaller of the two.

Hi @marekkokot , just reopening because i have a general question about the canonical form of a kmer. I want to compare 2 kmc datasets of results, does kmc keep the same canonical form ? Or is there a risk that the dataset 1 has one form and the other datasets the second form of the kmer ?

Thanks

marekkokot commented 1 year ago

In the current release, the canonical k-mer is always the lexicographically lowest of itself and its reverse complement, so yes. This may change in future releases.