refresh-bio / KMC

Fast and frugal disk based k-mer counter
277 stars 72 forks source link

same dataset, but different results using KMC and Jellyfish #190

Closed Youpu-Chen closed 2 years ago

Youpu-Chen commented 2 years ago

Hi. there. I am running some tests using the genome of E. coli. The sequence used in the analysis was downloaded from NCBI using command:

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

gunzip -c GCF_000005845.2_ASM584v2_genomic.fna.gz > Ecoli.fasta

The KMC command I used is below:

kmc -k15 -m12 -t16 -b -fm -ci0 Ecoli.fasta kmc.Ecoli.15mer kmc_tmp
kmc_tools transform kmc.Ecoli.15mer histogram kmc.Ecoli.15mer.histo

The Jellyfish command I used is below as well:

jellyfish count -m 15 -o jf.Ecoli.15mer -c 3 -s 1G -t 16 Ecoli.fasta 
jellyfish histo jf.Ecoli.15mer > jf.Ecoli.15mer.histo

It seems to me that the meaning of the command above are the same, but I got different results. For KMC, the histos is: 1 4443345 2 52456 3 11610 4 2700 5 5668 6 223 7 136 8 760 9 486 10 32

But for Jellyfish, the histo is: 1 4443336 2 52451 3 11609 4 2699 5 5669 6 223 7 136 8 761 9 485 10 32

I just get more and more confused by the test. I'll really appreciate if you could give me a hint. Thanks.

marekkokot commented 2 years ago

Hi,

what version of jellyfish are you using, becasue I cannot reproduce your results. This is what I get for jellyfish

jellyfish count -m 15 -o jf.Ecoli.15mer -c 3 -s 16G -t 16 Ecoli.fasta
jellyfish histo jf.Ecoli.15mer > jf.Ecoli.15mer.histo
head jf.Ecoli.15mer.histo
1 4443345
2 52456
3 11610
4 2700
5 5668
6 223
7 136
8 760
9 486
10 32
jellyfish --version
jellyfish 2.3.0

Installed with conda install kmer-jellyfish

Edit: also a really nicely reported issue, easy to repeat your steps, thanks!

Youpu-Chen commented 2 years ago

The jellyfish I am using is 2.3.0 too. I used jellyfish-2.3.0.tar.gz to install the software. And the KMC version is 3.2.1.

marekkokot commented 2 years ago

Hmm, OK now I see I used -s 16G for Jellyfish (anyway it runs much faster for example if you use -s 100M). Anyway for -s 1G I also got:

(base) mkokot@KOKI-PC-RYZEN:/mnt/c/nauka/DEVELOPMENT/kmc/issue-190$ jellyfish count -m 15 -o jf.Ecoli.15mer -c 3 -s 1G -t 16 Ecoli.fasta
(base) mkokot@KOKI-PC-RYZEN:/mnt/c/nauka/DEVELOPMENT/kmc/issue-190$ jellyfish histo jf.Ecoli.15mer > jf.Ecoli.15mer.histo
(base) mkokot@KOKI-PC-RYZEN:/mnt/c/nauka/DEVELOPMENT/kmc/issue-190$ head jf.Ecoli.15mer.histo
0 2
1 4443336
2 52451
3 11609
4 2699
5 5669
6 223
7 136
8 761
9 485

It seems to be a jellyfish bug, as for 1G the results are consistent with KMC. Furthermore, I have a trivial implementation of k-mer counter, available in KMC repo (tests/kmc_CLI/trivial-k-mer-counter). I counted k-mers with it and get the histo with simple c++ program:

#include <fstream>
#include <iostream>
#include <vector>
#include <string>

using namespace std;

int main(int argc, char**argv) {
    ifstream in(argv[1]);
    if(!in) {
        std::cerr << "Error: cannot open input file\n";
        return 1;
    }

    vector<uint64_t> histo;

    std::string kmer;
    uint64_t c;

    while(in >> kmer >> c) {
        if (histo.size() <= c)
            histo.resize(c+1);
        ++histo[c];
    }

    for(size_t i = 0 ; i < histo.size() ; ++i) {
        std::cout << i << "\t" << histo[i] << "\n";
    }
}
make 
bin/counter -k 15 -ci 0 -b ../../../../Ecoli.fasta x.txt
g++ -O3 histo.cpp -o histo
./histo x.txt > histo.txt
head histo.txt 

The result again is

0       0
1       4443345
2       52456
3       11610
4       2700
5       5668
6       223
7       136
8       760
9       486
Youpu-Chen commented 2 years ago

Thanks, it solve my problems!