muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

Problem building DB with metacache-build-refseq #24

Closed donovan-parks closed 2 years ago

donovan-parks commented 2 years ago

Hi,

I'm experiencing some issues building a RefSeq reference database for MetaCache v2.1.1.

The first issue is that the NCBI assembly_summary.txt files now have a few entries where the ftp_path field is set to na. This breaks the metacache-build-refseq script. This is easy enough to workaround once you know the problem, but it takes a bit of exploring to figure this out.

The more major issue is that metacache build command has been stuck at 98% for several hours now and all the metacache processes are in a sleep state. I am running this on a 16 CPU machine with 126 GB of RAM. Only 82 GB of RAM is used at the moment so this doesn't appear to be a memory issue. I can also verify I have plenty of disk space.

Have you experience this problem? Can anyone verify that they have recently been able to build a MetaCache DB from complete RefSeq genomes?

Thanks, Donovan

muellan commented 2 years ago

Hi,

thanks for pointing out the issue with the assembly_summery.txt files. We'll update our scripts.

Regarding the other problem - I think we have to try to reproduce it in order to diagnose what's wrong. Could you maybe try to build again, but with option -verbose. This will list the currently processed genomes. Maybe that could indicate at what point / processed sequence it goes wrong. Other than that we'll also (try to) build the latest RefSeq.

muellan commented 2 years ago

We also ran into the same problem when trying to build the latest RefSeq. I don't think it has anything to do with the input files. We'll investigate and let you know as soon as we find something.

donovan-parks commented 2 years ago

Great - thank you for the quick response and for looking into the issue.

muellan commented 2 years ago

Turns out the problem is rather mundane: the latest RefSeq releases contain so many complete genomes that our default data type for storing reference sequence ids is not sufficient anymore. The current defaults only support up to 65536 reference sequences. This should have triggered an error message during the build, but the error handling for this case seems to be broken and the build just paused.

We'll change the default from 16bit to 32bit and fix the error handling. In the meantime you can compile with: make MACROS="-DMC_TARGET_ID_TYPE=uint32_t"

Note that this will increase the memory footprint of the databases slightly.

muellan commented 2 years ago

The latest release contains a fixed download script, the new data type defaults and an updated documentation. https://github.com/muellan/metacache/releases/tag/v2.2.0

donovan-parks commented 2 years ago

Excellent - that you for the quick response and fix.