soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

Clarification about running hhblits on a cluster against UniRef30_2020_02 database #228

Closed gauravdiwan89 closed 3 years ago

gauravdiwan89 commented 3 years ago

I am planning to run hhblits for thousands of sequences using the latest Uniclust30 (UniRef30_2020_02) database as my target database. I intend to set up the run on our computing cluster and was going through the suggestions on the wiki about efficiently running hhblits on a computer cluster.

I understand that it is recommended to load the database files into the “/dev/shm” folder of the computing node on which the job is running. Although I have not tried this, I found out that only a couple of our nodes have virtual RAM disks (i.e. the “/dev/shm” folder) have a size that can allow for all the files of the Uniclust30 database (total of ~200gb) to be stored. Can you please clarify if I will be able to run the jobs on only the nodes which have 200gb of available virtual RAM disk space? Or am I missing something?

Many thanks!

milot-mirdita commented 3 years ago

I guess you can call that a luxury optimization, if you have machines with tons of memory you will see some benefit. One compromise would be to only add the _cs219.ff{data,index} to shm and symlink the other files. That would ensure that at least the small context states prefilter database never leaves memory. You could also you https://github.com/hoytech/vmtouch to achieve something similar.

gauravdiwan89 commented 3 years ago

Thanks a lot for the prompt answer! That is precisely what I was asking, if I could load only a subset of files into shm. And thanks for the link to vmtouch, will try it out in case the first solution isn't optimal.