Closed PawelSzczerbiak closed 7 months ago
Currently, createdb
does not offer the functionality to select a subset of databases to be generated. The performance of createdb
is largely determined by the efficiency of your I/O system. Foldseek operates by accessing all files concurrently in a multi-threaded manner by default. If your files are located on network storage, access times may be adversely affected. To mitigate this, consider transferring your PDB files to a local drive equipped with NVMe or SSD technology, which may help speed up the process.
Thanks for your answer! Indeed, changing the I/O system improved the performance.
When it comes to the database reuse, I'd suggest to change the default behaviour by not removing the temporary files by default or at least leave the choice to the user by adding a relevant argument to easy-cluster
command.
Anyway, cool software!
There is a --remove-tmp-files
parameter that controls deleting the intermediate files. (However, there are also a lot of parameters… đŸ˜…)
Hi, I'm trying to cluster the
hclust30
database witheasy-cluster
using different sets of input parameters. It turns out that the most time consuming step, taking 815h (!) on my 24 CPU core machine, is DB creation (createdb
); clustering itself (i.e. all the subsequent steps) lasts "only" ~35h which is OK. I want to rerun computations (i.e. clustering only without recreating the database) but, unfortunately, the DB files that are generated are not complete. The files that I already have:The files that are missing (treated as temporary and removed):
Is there a chance to modify the script responsible for
createdb
command to regenerate only the missing files?