refresh-bio / KMC

Fast and frugal disk based k-mer counter
253 stars 73 forks source link

new kmers that not exist in the sequnce #168

Closed gorliver closed 3 years ago

gorliver commented 3 years ago

Hi, I have a fasta file containing just one reads:

>t
GAACACATATGAATCATCAAATTAACAACCAATATT

I run

kmc -k35 -t20 -m32 -ci1 -b -fa t.fa t_k35 tmp/

and here is the stdout:

**
Stage 1: 100%
Stage 2: 100%
1st stage: 0.267759s
2nd stage: 0.093326s
Total    : 0.361085s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :           37
   No. of unique counted k-mers       :           37
   Total no. of k-mers                :        11132
   Total no. of reads                 :            1
   Total no. of super-k-mers          :           46

The output of dump is:

AACACATATGAATCATCAAATTAACAACCAATATT     1
GAACACATATGAATCATCAAATTAACAACCAATAT     1
AAATTAACAACCAATATTAAAAAAAAAAAAAAAAA     1
AACAACCAATATTAAAAAAAAAAAAAAAAAAAAAA     1
AACCAATATTAAAAAAAAAAAAAAAAAAAAAAAAA     1
AATCATCAAATTAACAACCAATATTAAAAAAAAAA     1
AATTAACAACCAATATTAAAAAAAAAAAAAAAAAA     1
ACAACCAATATTAAAAAAAAAAAAAAAAAAAAAAA     1
ACACATATGAATCATCAAATTAACAACCAATATTA     1
ACATATGAATCATCAAATTAACAACCAATATTAAA     1
ACCAATATTAAAAAAAAAAAAAAAAAAAAAAAAAA     1
ATATGAATCATCAAATTAACAACCAATATTAAAAA     1
ATCAAATTAACAACCAATATTAAAAAAAAAAAAAA     1
ATCATCAAATTAACAACCAATATTAAAAAAAAAAA     1
ATGAATCATCAAATTAACAACCAATATTAAAAAAA     1
ATTAACAACCAATATTAAAAAAAAAAAAAAAAAAA     1
CAAATTAACAACCAATATTAAAAAAAAAAAAAAAA     1
CAACCAATATTAAAAAAAAAAAAAAAAAAAAAAAA     1
CACATATGAATCATCAAATTAACAACCAATATTAA     1
CATATGAATCATCAAATTAACAACCAATATTAAAA     1
CATCAAATTAACAACCAATATTAAAAAAAAAAAAA     1
GAATCATCAAATTAACAACCAATATTAAAAAAAAA     1
TAACAACCAATATTAAAAAAAAAAAAAAAAAAAAA     1
TATGAATCATCAAATTAACAACCAATATTAAAAAA     1
TCAAATTAACAACCAATATTAAAAAAAAAAAAAAA     1
TCATCAAATTAACAACCAATATTAAAAAAAAAAAA     1
TGAATCATCAAATTAACAACCAATATTAAAAAAAA     1
TTAACAACCAATATTAAAAAAAAAAAAAAAAAAAA     1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     255
AATATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     1
ATATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     1
ATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     1
CAATATTAAAAAAAAAAAAAAAAAAAAAAAAAAAA     1
CCAATATTAAAAAAAAAAAAAAAAAAAAAAAAAAA     1
TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     1
TATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     1
TTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     1

I got many kmers that a not exist in reads.

How can I get rid of the unrelated kmers?

Thank you, Gorliver

marekkokot commented 3 years ago

Hi, this is not good :(

Which version of KMC are you using, is it one of the releases or do you compile the recent source code? I have tried with recent source code and the result is:

bin/kmc -k35 -t20 -m32 -ci1 -b -fa test.fa t_k35 .
**
Stage 1: 100%
Stage 2: 100%
1st stage: 1.34564s
2nd stage: 0.161782s
Total    : 1.50742s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            2
   No. of unique counted k-mers       :            2
   Total no. of k-mers                :            2
   Total no. of reads                 :            1
   Total no. of super-k-mers          :            1
AACACATATGAATCATCAAATTAACAACCAATATT 1
GAACACATATGAATCATCAAATTAACAACCAATAT 1 

so it seems to be correct.

One more question: what operating system are you using? And yet another question: do you have end of line character ('\n') after the sequence? There was a bug related to files without this character at the end of the file which was, as it seems, partially in the last release. But it seems it was not fixed totally, i.e. the missing EOL is detected but the results are wrong. It seems that the current source code works fine, do you have the possibility to just compile KMC? I will create a new release soon I think due to last extensions of the code, but I am not sure when.

gorliver commented 3 years ago

I use the precompiled release, the version is:

K-Mer Counter (KMC) ver. 3.1.1 (2019-05-19)

I run KMC on a HPCC so the precompiled one is the most convenient one for me. The system is centos.

There is no '\n' after the sequence. I added the '\n' and the result is the same. When I add another sequence, KMC generate the corrected kmers for the first sequence, but the second sequence is skipped by KMC:

**
Stage 1: 100%
Stage 2: 100%
1st stage: 0.304549s
2nd stage: 0.081882s
Total    : 0.386431s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            2
   No. of unique counted k-mers       :            2
   Total no. of k-mers                :            2
   Total no. of reads                 :            1
   Total no. of super-k-mers          :            1

The fasta file is:

>t
GAACACATATGAATCATCAAATTAACAACCAATATT
>t2
TTCCTCCATTATTTTATGGAACATGGGTAACCTCTA

The kmer I got is (from the dump command):

AACACATATGAATCATCAAATTAACAACCAATATT     1
GAACACATATGAATCATCAAATTAACAACCAATAT     1

I also tried a fasta file contain three 36bp sequences and KMC successfully generated correct kmers for the first two reads but skipped the third reads.

I will work around to compile the latest KMC on the HPCC, but a precompiled release is highly appreciated.

gorliver commented 3 years ago

I managed to compile the latest version and all the issues are gone and it works great now. Many thanks for your help!