stage-in of files to other file system in cluster jobs

cmeesters commented 2 years ago

Hi,

On many clusters the alphafold reference data need to be stored on parallel filesystem, which are particularly vulnerable to random access patterns or are exposed by means of NFS to GPU nodes, where GPU jobs like alphafold are running.

Alas, it is hardly possible to stage-in the entire reference database to a node-local file system or even a ramdisk, because it is too big. Permanently storing the entire database on those nodes where alphafold is supposed to run, is not an always an option either.

All this means severe performance penalties, as essentially alphafold subtasks like hhblits will open a number of syscalls, which in case of a busy file system, will take a long time to be answered.

Therefore, a nice new feature would be that alphafold, which is aware of its needed files for any sub-task, would gain a flag designating a job-local path for such operation and be able to copy such files from its ALPHAFOLD_DATA_DIR onto the job-local path.

Kind regards Christian Meesters

AnyaP commented 2 years ago

Hi, thanks for this feature request.

One suggestion would be to disentange the MSA search stage (which requires the large databases) and the modeling stage (which requires a GPU) to run them on workers with apropriate resources. There are some community efforts on this, e.g. https://parafold.sjtu.edu.cn/.

Unfortunately, given it is related to a particular cluster / distributed file system setup, we are unikelly to work on this, so I am closing the issue for now.

cmeesters commented 2 years ago

Hi,

Well ...

Unfortunately, given it is related to a particular cluster / distributed file system setup, we are unikelly to work on this, so I am closing the issue for now.

It would only mean to provide a command line option, e.g. --scratch-dir to point to a directory, which only is available in the job context and totally an option to be disregarded, if not given. And it appears not to be too special, considering reports like #437 and some comments in the HPC community.

I could try to work out something and do a pull request. This would take time, as the code is not easy to understand in this context and testing will be very time-consuming and success is not guaranteed. So, I will only contemplate to test this, if this is worth considering in your opinion.

google-deepmind / alphafold

stage-in of files to other file system in cluster jobs #421