split database for the storage random access limitation problem

baoxingsong commented 2 years ago

Dear hh-suite developers,

Thanks for creating such a wonderful tool. We are trying to align 750,000 proteins sequences using hhblits on a compute cluster. However, we are having the storage random access limitation problem as you explained in the document. Since the cluster is not under my control, I could not install SSDs there, and each node of the cluster does not have enough RAM to put the database.

We are wondering, is there any way to split the database into smaller databases? So what we could put each smaller database into /dev/shm, align the query sequence again with each of the smaller databases separately and merge the result together, please? Would that generate an identical result, please?

baoxingsong commented 2 years ago

If it is not possible to randomly split the database, maybe split it according to similar sequence clusters?

milot-mirdita commented 2 years ago

Splitting the database probably won't work well for multiple reasons:

E-value computation is tied to a database size and recomputing HHblits E-values is not straightforward
Iterative search doesn't work at all since you could miss important sequences for your evolving profile in another split

A better solution is to probably only keep the cs219 on a fast storage (local non-NFS disks or even /dev/shm) that database is comparatively tiny and the most important for performance. It would still be good if the a3m/hhm databases are not on a network storage but a local disk or SSD.

You can do something like:

cp path_to/database_cs219.ff{data,index} /dev/shm
cd /dev/shm
ln -s path_to/database_a3m.ffdata
ln -s path_to/database_a3m.ffindex
ln -s path_to/database_hhm.ffdata
ln -s path_to/database_hhm.ffindex
hhblits -d /dev/shm/database ...

baoxingsong commented 2 years ago

Thanks, by doing this, it is really significantly faster.

soedinglab / hh-suite

split database for the storage random access limitation problem #281