Closed tolot27 closed 5 years ago
Hi,
since I don't know your input and exact command line used I will try to explain what may cause this in general. Possible reason 1: In default mode, KMC drops k-mers with occurring less than twice (may be overriden with -ci parameter). Lets assume there are following k-mers in the input:
ACCA
ACCA
TGGT
If you run KMC without -b
option it will count
ACCA 3
If -b
is used it would produce
ACCA 2
(notice that TGGT 1
will be absent since it only occurs once in the input).
Possible reason 2:
In default KMC store counters of k-mers on one byte, which means the maximum value of a counter is 255 (may be overriden with -cs parameter).
Now, consider an input file containting 1000 k-mers AAAA
and 700 k-mers TTTT
.
If you run KMC without -b
the result is:
AAAA 255
If you run KMC with -b parameter the result will be:
AAAA 255
TTTT 255
Maybe there are some other reasons, but I don't know them right now. If you run KMC with -ci1 and -cs with some high value and it still bevahes the way you described, it may be in fact a bug. In such case, could you please provide some example input file and command line causing this?
Kmers that are reverse complements of themselves are counted twice with -b but once without, right?
@notestaff No, they are not. Should they? Why?
Due to lack of activity, I am closing this issue, if needed please reopen this issue and supply some more info.
Hi @marekkokot , just reopening because i have a general question about the canonical form of a kmer. I want to compare 2 kmc datasets of results, does kmc keep the same canonical form ? Or is there a risk that the dataset 1 has one form and the other datasets the second form of the kmer ?
Thanks
ok found in the first publication: Usually we should not distinguish between a k-mer and its reversed complement, and by the “canonical k-mer” we will mean the lexicographically smaller of the two.
Hi @marekkokot , just reopening because i have a general question about the canonical form of a kmer. I want to compare 2 kmc datasets of results, does kmc keep the same canonical form ? Or is there a risk that the dataset 1 has one form and the other datasets the second form of the kmer ?
Thanks
In the current release, the canonical k-mer is always the lexicographically lowest of itself and its reverse complement, so yes. This may change in future releases.
Why is the non-canonical kmer count (parameter
-b
) of a sequence different than the canonical kmer count of the sequence plus its reverse complement? According to #46 for each kmer its reverse complement is calculated.