refresh-bio / kmer-db

Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
GNU General Public License v3.0
81 stars 16 forks source link

kmer-db build failed #2

Closed dgg32 closed 2 years ago

dgg32 commented 6 years ago

Hello developers, I have made kmer-db. But as soon as I tried the first command kmer-db, it failed on all genomes.

My command:

./kmer-db-1.1 build -k 20 fna_list genome_db

In "fna_list", I tried either the absolute paths or the just the file names, and put this file together with the fna files,

txid1129897_refseq_genomic.fna txid1131272_refseq_genomic.fna txid1134406_refseq_genomic.fna txid1193806_refseq_genomic.fna txid1193807_refseq_genomic.fna ...

neither was working:

Processing samples... failed:txid1129897_refseq_genomic.fna failed:txid1131272_refseq_genomic.fna failed:txid1134406_refseq_genomic.fna failed:txid1193806_refseq_genomic.fna failed:txid1193807_refseq_genomic.fna failed:txid1204385_genbank_genomic.fna

Please help! Thank you!

agudys commented 6 years ago

Hello, It seems there is a bug in the README. Currently Kmer-db supports only gzipped genome files (you need to specify a filename without gz extension as it is added automatically). Please try running it on a gzipped samples. I'll try to add support of raw FASTA files as soon as possible and let you know. Adam

dgg32 commented 6 years ago

Hi agudys. Still failed. I have gzip the files and tried both fna and fna.gz in the fna_list (screenshots), all failed. kmer kmer2

agudys commented 6 years ago

This message is shown when a sample file has not been found. Is your working directory the same as the directory where fna_list.txt is placed? If not, absolute paths must be specified in a file (you mentioned in the first post that you had tried it, but maybe this was before gz extensions removal). In the meantime, I have fixed filename issues so Kmer-db automatically tries to add following extensions: .fna, .fna.gz, and .gz to sample names specified in a list file.

dgg32 commented 6 years ago

Hi, agudys. I took the 1.11 version and now put the absolute paths in the fna_list.txt file. The program now works through the initial step. So thank you for your effort.

But it now dies on a seg fault without giving hints how to fix it:

Kmer-db version 1.11 S. Deorowicz, A. Gudys, M. Dlugosz, M. Kokot, and A. Danek (c) 2018

Database building mode (fasta genomes) Processing samples... 40/57... Segmentation fault (core dumped)

Any idea please?

agudys commented 6 years ago

Could you send me the data? I'll try to debug it by myself and let you know.

yhg926 commented 5 years ago

Hi, my version is Kmer-db version 1.53 (19.04.2019), I met the same problems:

failed:/data/hgyi/work/test2/cp.5.Streptomyces_violaceusniger_Tu_4113.fna.gz failed:/data/hgyi/work/test2/cp.5.Streptomyces_xiamenensis_318.fna.gz failed:/data/hgyi/work/test2/cp.5.Streptomyces_xinghaiensis_S187.fna.gz failed:/data/hgyi/work/test2/cp.5.Streptosporangium_roseum_DSM_43021.fna.gz failed:/data/hgyi/work/test2/cp.5.synthetic_Escherichia_coli_C321_deltaA_CP006698.LargeContigs.fna.gz failed:/data/hgyi/work/test2/cp.5.synthetic_Escherichia_coli_C321_deltaA_CP010455.LargeContigs.fna.gz failed:/data/hgyi/work/test2/cp.5.synthetic_Escherichia_coli_C321_deltaA_CP010456.LargeContigs.fna.gz Analysis finished at Sat Apr 20 16:40:11 2019

EXECUTION TIMES Total: 3.28242 Processing time: 6.91445e-310 Hashatable processing (parallel): 0 imbalance: -nan Resize: 0 Find'n'add: 0 Sort time (parallel): 0 Pattern extension time (parallel): 0

STATISTICS Number of samples: 0 Number of patterns: 1 (0 B) Number of k-mers: 0 K-mer length: 0 Minhash fraction: 0 Workers count: 12

Serializing database...OK (0.339099 seconds) Releasing memory...OK (0.000421392 seconds)

agudys commented 5 years ago

Hello,

Try removing extensions from your file with sample list. Currently, Kmer-db automatically tries .fasta, fasta.gz, .fna, and .fna.gz extensions when genome input is used.

Regards, Adam

yhg926 commented 5 years ago

Hi, I have successfully run kmer-db. Thank you for you help. But it takes ~96 min to finish my test dataset of ~100k genomes on my 12 core machine use 0.04% of Kmers. Is this a normal performance OR I need to optimize my options ?

Here are my commands:

hgyi@sustc-HG:/data/hgyi/work$ /usr/bin/time -v kmer-db build -f 0.0004 -t 12 -k 16 kmerdb.list.2 db

Kmer-db version 1.52 (16.04.2019) S. Deorowicz, A. Gudys, M. Dlugosz, M. Kokot, and A. Danek (c) 2018

Database building mode (from fasta genomes) Analysis started at Sun Apr 21 11:58:59 2019

Processing samples... 99750/99750... Analysis finished at Sun Apr 21 12:43:57 2019

EXECUTION TIMES Total: 2697.55 Processing time: -nan Hashatable processing (parallel): 8.83383 imbalance: 96 Resize: 0.0512286 Find'n'add: 0.340432 Sort time (parallel): 20.9353 Pattern extension time (parallel): 14.332

STATISTICS Number of samples: 99,750 Number of patterns: 402,204 (133,784,336 B) Number of k-mers: 533,193 K-mer length: 16 Minhash fraction: 0.0004 Workers count: 12

Serializing database...OK (0.476834 seconds) Releasing memory...OK (0.0787295 seconds)

Command being timed: "kmer-db build -f 0.0004 -t 12 -k 16 kmerdb.list.2 db"
User time (seconds): 6086.98
System time (seconds): 180.21
Percent of CPU this job got: 232%
Elapsed (wall clock) time (h:mm:ss or m:ss): 44:58.23
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1583548
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3
Minor (reclaiming a frame) page faults: 414504
Voluntary context switches: 18680790
Involuntary context switches: 99488
Swaps: 0
File system inputs: 282312160
File system outputs: 269416
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

hgyi@sustc-HG:/data/hgyi/work$ /usr/bin/time -v kmer-db -t 12 new2all db ./kmerdb.list.2 table

Kmer-db version 1.52 (16.04.2019) S. Deorowicz, A. Gudys, M. Dlugosz, M. Kokot, and A. Danek (c) 2018

Set of new samples (from fasta genomes) versus entire database comparison Loading k-mer database db...Loading general info... Loading kmer hashtables... 1/1... Loading patterns... 402000/402204... OK (0.426714 seconds) Number of samples: 99,750 Number of patterns: 402,204 (0 B) Number of k-mers: 0 K-mer length: 16 Minhash fraction: 0.0004 Workers count: 12

Storing matrix of common k-mers in table...Loading queries...Processing queries... 99750...

EXECUTION TIMES Total: 5754.78 Loading k-mers: 0 Processing time: 4325.93

Command being timed: "kmer-db -t 12 new2all db ./kmerdb.list.2 table"
User time (seconds): 49912.50
System time (seconds): 135.41
Percent of CPU this job got: 869%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:35:55
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1229320
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 16
Minor (reclaiming a frame) page faults: 1929282
Voluntary context switches: 1698016
Involuntary context switches: 1997418
Swaps: 0
File system inputs: 282471440
File system outputs: 61120040
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
agudys commented 5 years ago

@yhg926 Hello, As for the build time, according to your report it took 45 minutes.

Elapsed (wall clock) time (h:mm:ss or m:ss): 44:58.23

I made an experiment with same parameters as you on ~150k microbial genomes from NCBI (180GB) and it took 35 minutes, so I got about twice the throughput you obtained (assuming similar input size). The difference is not dramatic and may be due to hardware configuration (including disk performance). If you provide me with some details on your hardware and total size of the dataset, I'll have a closer look on that.

As for the distance calculation, your machine indeed needed 96 minutes.

Elapsed (wall clock) time (h:mm:ss or m:ss): 1:35:55

However, I can see that you were using new2all mode with same list of genomes that was used to construct the database. For this purpose you should rather use all2all mode which makes comparison for samples already in the database (and is particularly optimized for this). new2all was designed for comparison of new samples against database. Please try running all2all and let me know.

Regards, Adam

agudys commented 5 years ago

@yhg926 Please also note, that performance of the build mode with such small filter value (0.0004) is probably limited with disk. Tens of megabytes per second is typical transfer when reading lot of ~1MB files (unless you have SSD). Therefore, I suggest you experimenting with higher filter values (even ten times larger) because you can probably get much better quality at the same speed.