Keep DB in memory between runs

pirovc / ganon

ganon2 classifies genomic sequences against large sets of references efficiently, with integrated download and update of databases (refseq/genbank), taxonomic profiling (ncbi/gtdb), binning and hierarchical classification, customized reporting and more

https://pirovc.github.io/ganon/

MIT License

86 stars 13 forks source link

Keep DB in memory between runs #185

Closed donovan-h-parks closed 2 years ago

donovan-h-parks commented 2 years ago

Hi.

Is there a way to keep the Ganon DB in memory between running the classify method on different samples? At least for my use case, the majority of time is spent loading the DB into memory. I appreciate I could combine all my samples into a single file, but this makes for a rather awkward workflow and a lot of extra post-processing of results.

Thanks.

pirovc commented 2 years ago

Hi @dparks1134

Not directly, unfortunately. One could use a ramfs to speed up the database loading (#92), something like:

mkdir ~/memorymap
sudo mount -t ramfs none ~/memorymap
cp ganon_database.* ~/memorymap/
ganon classify --db-prefix ~/memorymap/ganon_database ...

this should speed up loading times and it works with the current version ganon (v1.0.0).

I already notice this problem from other users and think it would be nice to have it integrated, marking as enhancement. However this may be non-trivial to implement with the use of hierarchical databases.

donovan-h-parks commented 2 years ago

Thanks - I'll give it a try. Feel free to close this unless it is helpful to keep it as an open enhancement.

rjsorr commented 2 years ago

Hi @pirovc I gave this a try and got Error code: -9 using v1.1.1

conda activate python3.7_environment sudo mkdir /memorymap sudo mount -t ramfs none /memorymap sudo cp /media/ubuntu/Elements/reference_genomes/ganon/ARC_refseq_ALL_db/ARC_refseq_ALL_db. /memorymap/ & sudo cp /media/ubuntu/Elements/reference_genomes/ganon/BAC_refseq_ALL_db/BAC_refseq_ALL_db. /memorymap/ & sudo cp /media/ubuntu/Elements/reference_genomes/ganon/EUK_refseq_CG_db/EUK_refseq_CG_db. /memorymap/ & sudo cp /media/ubuntu/Elements/reference_genomes/ganon/VIRAL_refseq_ALL_db/VIRAL_refseq_ALL_db. /memorymap/ & for i in *_1_val_1.fq.gz; do b=${i%%_1_val_1.fq.gz} ganon classify -d /memorymap/ARC_refseq_ALL_db \ /memorymap/BAC_refseq_ALL_db \ /memorymap/EUK_refseq_CG_db \ /memorymap/VIRAL_refseq_ALL_db \ -p "$b"_1_val_1.fq.gz "$b"_2_val_2.fq.gz \ -o "$b"_ganon_results --output-lca --output-unclassified -t 28 done &

pirovc commented 2 years ago

Hi @rjsorr. Does the same command work with the database files in a "normal" disk without using ramfs? I tested the ganon classify multiple databases in the ramfs and it works just fine for me.

rjsorr commented 2 years ago

this works, if that is what you mean?

cd /media/ubuntu/Elements/NEWPIPELINE_MetaAIR/RAW_DATA/neg_pos/TG_out conda activate python3.7_environment for i in *_1_val_1.fq.gz; do b=${i%%_1_val_1.fq.gz} ganon classify -d /media/ubuntu/Elements/reference_genomes/ganon/ARC_refseq_ALL_db/ARC_refseq_ALL_db \ /media/ubuntu/Elements/reference_genomes/ganon/BAC_refseq_ALL_db/BAC_refseq_ALL_db \ /media/ubuntu/Elements/reference_genomes/ganon/EUK_refseq_CG_db/EUK_refseq_CG_db \ /media/ubuntu/Elements/reference_genomes/ganon/VIRAL_refseq_ALL_db/VIRAL_refseq_ALL_db \ -p "$b"_1_val_1.fq.gz "$b"_2_val_2.fq.gz \ -o "$b"_ganon_results --output-lca -t 28 --verbose > "$b"_ganon_classify.log 2>&1 done &

pirovc commented 2 years ago

Unfortunately this should be a problem in your side. I re-created the scenario here with several databases in the ramfs and the same parameters (with version 1.1.1) and it just works. The sequential execution of ganon commands with the same set of databases is automatically faster in modern system due to caching, but indeed it takes some time. Anyways this is supposed to be a workaround, soon there will be an integrated batch execution function for ganon classify, but I cannot guarantee when this is going to be available.