muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

Database load times highly and unexpectedly non-linear #22

Closed jdwinkler-lanzatech closed 3 years ago

jdwinkler-lanzatech commented 3 years ago

Hi,

Thanks again for your work on MetaCache. I have noticed something a bit strange with the new version (MetaCache 2.0) where database loading takes much longer than I would expect (without any signs of I/O contention in iotop). The problem seems especially bad once the database loading % exceeds 90%. Loading from 0 to 90% seems to comprise ~20% of the load time, followed by the remaining 10% of the database. I am basing these percentages off of the stderr log, see attached for a representative log.

The actual analysis seems fast once it gets going. As it stands, I think I am spending about 75% of the time per sample just reading the database, but I am not sure why. Do you have any suggestions about how to improve sample throughput? I know completion percentages are usually off but I'm not sure how to debug further.

metacache_stderr.log

jdwinkler-lanzatech commented 3 years ago

In case it helps, I am using a database built in July 2021 from Refseq with archaeal/bacterial genomes.

Funatiq commented 3 years ago

Hi! I will investigate the issue tomorrow. Could you test if reducing the load factor (e.g. "-max-load-fac 0.7") does improve the loading time?

jdwinkler-lanzatech commented 3 years ago

Sure, I'll give that a shot.

jdwinkler-lanzatech commented 3 years ago

The speed appears to be about the same when using the -max-load-fac 0.7 flag versus default. Same memory usage as well, which is not what I expected from the flag description.

Funatiq commented 3 years ago

I found a bug in our code where a wrong load factor is set for the query mode. Fixed in the new release.

jdwinkler-lanzatech commented 3 years ago

Great, thanks.

jdwinkler-lanzatech commented 3 years ago

Just wanted to confirm that the fix works for me for posterity. Thanks again!