leylabmpi / Struo2

Scalable creating/updating of metagenome profiling databases
MIT License
58 stars 8 forks source link

Kraken2_build step stalling #27

Open dgolden96 opened 2 years ago

dgolden96 commented 2 years ago

Hi there,

I'm continuing to troubleshoot the db-update process for a kraken2 database, and I've hit a wall at the kraken2_build step. The pipeline doesn't throw any errors; it just continues to run indefinitely (12+ hours without failure or completion). It seems similar to the problem described here: https://github.com/DerrickWood/kraken2/issues/428

So far, I've tried to implement the workaround mentioned in the comments of that issue I linked, where you add the --fast-build flag to the kraken2 call in the db-update snakefile, but it doesn't seem to have solved the issue. Any chance you've seen this before and/or have any thoughts on what might be causing it? I definitely have enough RAM. I'm using 28 cores with 16 Gb per core.

Thanks!

nick-youngblut commented 2 years ago

I've (thankfully) never experienced that issue. How many genomes are included in the build?

dgolden96 commented 2 years ago

The database to be updated is the full GTDB_release207, and the sample TSV I'm trying to add includes ~4,000 genomes

zoey-rw commented 2 years ago

A related question: if we instead passed the reads that were unclassified from GTDB into a second database (db-create with only the non-GTDB genomes), should that give similar results as a single database via the db-update workflow? There are methods for combining outputs for the same sample from different databases, though I imagine there could be downstream effects on Bracken estimates.

nick-youngblut commented 2 years ago

The downside of a 2-step classification approach versus a 1-step is that there is no direct "competition" during classification across the 2 steps. So, some reads could be falsely classified in the 1st step when they would actually be classified as something in the 2nd step if the 2 reference databases were combined.

MixalisSn commented 11 months ago

Same problem here. I ran the kraken2 database building using 40 cores (7 GB each), and after 24 hours the process stalled at this point:

Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 75566900660 bytes
Capacity estimation complete. [37m21.355s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 16 bits reserved for taxid.
nick-youngblut commented 11 months ago

@MixalisSn do you think that the stalling could be due to limited memory?

MixalisSn commented 11 months ago

@nick-youngblut I thought the 120 GB were enough. Any way, I added the --fast-build flag, using the same resources, and the build was completed successfully.