refresh-bio / KMC

Fast and frugal disk based k-mer counter
256 stars 73 forks source link

Garbage kmer results for short sequences #130

Closed tseemann closed 4 years ago

tseemann commented 4 years ago

I am getting false kmer counts with simple examples like the one below.

  1. AAA=255, but should be AAA=1 ?
  2. Total kmers should be 10 not 32659 ?
  3. Total no. of reads should be 1 not 2 ?

If I change to -ci0 then i get 32525 fake k-mers. if I change to -p5 i get 32706 fakes. Might be related to #103

$ cat seq3.fa
>seq3
AAACCCAACCAC

$ kmc -ci1 -k3 -fa -seq3.fa seq3 /tmp/
Stage 1: 100%
1st stage: 0.002712s
2nd stage: 0.00102s
Total    : 0.003732s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            8
   No. of unique counted k-mers       :            8
   Total no. of k-mers                :        32659
   Total no. of reads                 :            2
   Total no. of super-k-mers          :            0

$ kmc_dump seq3 /dev/stdout
AAA     255
AAC     2
ACA     1
ACC     2
CAA     2
CAC     1
CCA     2
CCC     1
tseemann commented 4 years ago

UPDATE:

If I change it to -fm (multi-fasta, even though only 1 sequence) it seems to do better, but still wrong.

   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            7
   No. of unique counted k-mers       :            7
   Total no. of k-mers                :           10
   Total no. of sequences             :            1
   Total no. of super-k-mers          :            0
marekkokot commented 4 years ago

Thanks for reporting that. What is the number of cores on your machine? On my machine results are also incorrect, but not exactly as same as what you reported.
I have also checked with -fm switch and get the same results as you. Why do you think they are incorrect, because at first look it seems OK to me. Am I missing something?

marekkokot commented 4 years ago

Hi, I think 85ad76956d890aa24fc8525eee5653078ed86ace should fix the issue. Could you please verify it on your environment and let me know?

tseemann commented 4 years ago

I have 72 cores.

I can only test tagged versions.

git tag v3.3.2
git push --tags

?

marekkokot commented 4 years ago

I have tagged this version.

tseemann commented 4 years ago

Thank you!