Rerun clustering with different parameters

PawelSzczerbiak commented 11 months ago

Hi, I'm trying to cluster the hclust30 database with easy-cluster using different sets of input parameters. It turns out that the most time consuming step, taking 815h (!) on my 24 CPU core machine, is DB creation (createdb); clustering itself (i.e. all the subsequent steps) lasts "only" ~35h which is OK. I want to rerun computations (i.e. clustering only without recreating the database) but, unfortunately, the DB files that are generated are not complete. The files that I already have:

input_ca 
input_ca.dbtype 
input_ca.index 
input_ss 
input_ss.dbtype 
input_ss.index

The files that are missing (treated as temporary and removed):

input 
input.dbtype 
input.lookup
input.index 
input.source
input_h 
input_h.dbtype  
input_h.index

Is there a chance to modify the script responsible for createdb command to regenerate only the missing files?

martin-steinegger commented 11 months ago

Currently, createdb does not offer the functionality to select a subset of databases to be generated. The performance of createdb is largely determined by the efficiency of your I/O system. Foldseek operates by accessing all files concurrently in a multi-threaded manner by default. If your files are located on network storage, access times may be adversely affected. To mitigate this, consider transferring your PDB files to a local drive equipped with NVMe or SSD technology, which may help speed up the process.

PawelSzczerbiak commented 10 months ago

Thanks for your answer! Indeed, changing the I/O system improved the performance.

When it comes to the database reuse, I'd suggest to change the default behaviour by not removing the temporary files by default or at least leave the choice to the user by adding a relevant argument to easy-cluster command.

Anyway, cool software!

milot-mirdita commented 10 months ago

There is a --remove-tmp-files parameter that controls deleting the intermediate files. (However, there are also a lot of parameters… 😅)

steineggerlab / foldseek

Rerun clustering with different parameters #199