muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

metacache-build-refseq gets killed at 46% every time i try to build the database #23

Closed clanzett closed 2 years ago

clanzett commented 2 years ago

When i try to build the refseq database the process gets killed at 46%. This happens every time exactly at the same percent value. There is enough storage available. Hardware specs are: 64GB RAM, 16 Cores

Building new database 'refseq' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 2429955 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.0.1 (20210305)
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned short int 16 bits
target limit         65535
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Reading sequence to taxon mappings from genomes/refseq/archaea/assembly_summary.txt
Reading sequence to taxon mappings from genomes/refseq/bacteria/assembly_summary.txt
Reading sequence to taxon mappings from genomes/refseq/viral/assembly_summary.txt
Reading sequence to taxon mappings from genomes/taxonomy/assembly_summary_refseq.txt
Reading sequence to taxon mappings from genomes/taxonomy/assembly_summary_refseq_historical.txt
Processing reference sequences.
[=================>                                                        ] 24%
[=================>                                                        ] 24%

[==================================>                                       ] 46%Killed
muellan commented 2 years ago

I'm afraid that the reason might be that the latest Refseq version has already become too large for building a complete database (with default settings) within 64GB.

As I see it there are three ways around this:

Oh, and you should of course make sure that the reference genome files were downloaded properly. And maybe you should also do a quick test with just the virus genomes to make sure that everything works in principle.

clanzett commented 2 years ago

Yes it seems to be indeed a memory problem. i increased the memory to 128GB and the build is now skipped the 46% problem. we will see if the build will complete this time. i will keep you postet.

If not i will try to build a viral ony db and take a look if there is a problem in general with the build process.

anyway thanks for your quick reply.