google-deepmind / alphafold

Open source code for AlphaFold.
Apache License 2.0
12.29k stars 2.2k forks source link

AF2 non-docker in a cluster enviroment #339

Closed JuergenUniVie closed 2 years ago

JuergenUniVie commented 2 years ago

Hello,

is there a way to run alphafold in a cluster environment with a job scheduling system (slurm/openpbs)? I have several nodes with strong GPUs available, I would like to use them as well.

best wishes, Juergen

DelilahYM commented 2 years ago

We are running alphafold on our cluster. 1st, you need to have alphafold setup on your cluster including download all the database. 2nd, allocate resources with GPU 3rd, run it. (with proper options, of course)

JuergenUniVie commented 2 years ago

Dear DelilahYM,

1st, all the Databases are installed and AF2 runs without a dock. I use the alphafold script from https://github.com/kalininalab/alphafold_non_docker Would you please help me and describe how to get it to work with slurm or openpbs? Did you write a script for this or do you start via run_alphafold.sh? Which data did you adapt?

many thanks and best wishes

DelilahYM commented 2 years ago

Funny thing. I am actually working on my installation at the moment (non-docker version of course), and just tested working. We use slurm. You install it as instructed https://github.com/kalininalab/alphafold_non_docker create your own conda env with the required packages. make sure to check your cuda version .etc so the specific version of the packages are supported (jax, jaxlib) I downloaded the database with https://github.com/kalininalab/alphafold_non_docker/blob/main/download_db.sh (do make sure you have enough storage). it takes a long time to download all the database, so I suggest you write a slurm script and submit a job to run the download script. once you have everything downloaded, one thing I noticed is alphafold/alphafold/common/stereo_chemical_props.txt is missing. (it was supposed to be downloaded during the docker build, if docker is used). previous version of alphafold had it in the git, then somehow, new version doesn't. Once you have that. you can test with your own data. I used https://github.com/kalininalab/alphafold_non_docker/blob/main/run_alphafold.sh to run this script, you do need to be in the alphafold folder that you pulled earlier in the setup. so make sure you cd into that in your slurm script if you have your submission script somewhere else. each cluster is a little different. so slurm script will look different.

Augustin-Zidek commented 2 years ago

We currently provide support only for running via Docker.

You can find more information about running under Singularity here: https://github.com/deepmind/alphafold/issues/10 and https://github.com/deepmind/alphafold/issues/24.

yijietseng commented 1 year ago

Hello,

We are trying to the setup non-docker AF2 on our cluster using the scripts from https://github.com/kalininalab/alphafold_non_docker. But while testing, we got the following error. And we just want to see if any of you have any suggestions on how to fix this problem.

I1006 07:12:34.027368 140369674590016 templates.py:857] Using precomputed obsolete pdbs ./AF2_DB/pdb_mmcif/obsolete.dat. I1006 07:12:34.936928 140369674590016 tpu_client.py:54] Starting the local TPU driver. I1006 07:12:34.949247 140369674590016 xla_bridge.py:212] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local:// I1006 07:12:35.068033 140369674590016 xla_bridge.py:212] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available. I1006 07:12:41.648868 140369674590016 run_alphafold.py:376] Have 5 models: ['model_1_pred_0', 'model_2_pred_0', 'model_3_pred_0', 'model_4_pred_0', 'model_5_pred_0'] I1006 07:12:41.649096 140369674590016 run_alphafold.py:393] Using random seed 320158810403615912 for the data pipeline I1006 07:12:41.649371 140369674590016 run_alphafold.py:161] Predicting 1TELWT3 I1006 07:12:41.685194 140369674590016 jackhmmer.py:133] Launching subprocess "/home/tseng3/miniconda3/envs/af2/bin/jackhmmer -o /dev/null -A /tmp/tmpyculok8/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ./input/1TEL_WT3.fasta ./AF2_DB/uniref90/uniref90.fasta" I1006 07:12:41.761307 140369674590016 utils.py:36] Started Jackhmmer (uniref90.fasta) query I1006 07:18:21.009910 140369674590016 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 339.248 seconds I1006 07:18:21.044890 140369674590016 jackhmmer.py:133] Launching subprocess "/home/tseng3/miniconda3/envs/af2/bin/jackhmmer -o /dev/null -A /tmp/tmpnwqqqyqw/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ./input/1TEL_WT3.fasta ./AF2_DB/mgnify/mgy_clusters_2018_12.fa" I1006 07:18:21.138897 140369674590016 utils.py:36] Started Jackhmmer (mgy_clusters_2018_12.fa) query I1006 07:24:25.972769 140369674590016 utils.py:40] Finished Jackhmmer (mgy_clusters_2018_12.fa) query in 364.833 seconds I1006 07:24:26.036239 140369674590016 hhsearch.py:85] Launching subprocess "/home/tseng3/miniconda3/envs/af2/bin/hhsearch -i /tmp/tmpj_d_icnl/query.a3m -o /tmp/tmpj_d_icnl/output.hhr -maxseq 1000000 -d ./AF2_DB/pdb70/pdb70" I1006 07:24:26.117509 140369674590016 utils.py:36] Started HHsearch query I1006 07:28:41.636944 140369674590016 utils.py:40] Finished HHsearch query in 255.519 seconds I1006 07:28:41.699008 140369674590016 hhblits.py:128] Launching subprocess "/home/tseng3/miniconda3/envs/af2/bin/hhblits -i ./input/1TEL_WT3.fasta -cpu 4 -oa3m /tmp/tmpgzc0tjn9/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d ./AF2_DB/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d ./AF2_DB/uniclust30/uniclust30_2018_08/uniclust30_2018_08" I1006 07:28:41.779823 140369674590016 utils.py:36] Started HHblits query I1006 10:16:02.015240 140369674590016 utils.py:40] Finished HHblits query in 10040.235 seconds I1006 10:16:02.124383 140369674590016 templates.py:878] Searching for template for: MGSSHHHHHHSIALPAHLRLQPIYWSRDDVAQWLKWAENEFSLSPIDSNTFEMNGKALLLLTKEDFRYRSPHSGDELYELLQHILGGGGG I1006 10:16:02.544437 140369674590016 templates.py:267] Found an exact template match 2qar_B. I1006 10:16:02.785476 140369674590016 templates.py:267] Found an exact template match 1sv0_B. I1006 10:16:02.953306 140369674590016 templates.py:267] Found an exact template match 1sv4_B. I1006 10:16:04.026005 140369674590016 templates.py:267] Found an exact template match 1sxd_A. I1006 10:16:05.506565 140369674590016 templates.py:267] Found an exact template match 1x66_A. I1006 10:16:07.745375 140369674590016 templates.py:267] Found an exact template match 2jv3_A. I1006 10:16:10.091728 140369674590016 templates.py:267] Found an exact template match 2dkx_A. I1006 10:16:11.120151 140369674590016 templates.py:267] Found an exact template match 1sxe_A. I1006 10:16:11.294945 140369674590016 templates.py:267] Found an exact template match 1ji7_B. I1006 10:16:11.606558 140369674590016 templates.py:267] Found an exact template match 4mhv_B. I1006 10:16:11.765944 140369674590016 templates.py:267] Found an exact template match 2qb1_B. I1006 10:16:12.136505 140369674590016 templates.py:267] Found an exact template match 2qb0_D. I1006 10:16:12.348974 140369674590016 templates.py:267] Found an exact template match 5l0p_A. I1006 10:16:14.605201 140369674590016 templates.py:267] Found an exact template match 2ytu_A. I1006 10:16:14.614635 140369674590016 templates.py:267] Found an exact template match 5l0p_A. I1006 10:16:15.052233 140369674590016 templates.py:267] Found an exact template match 1lky_C. I1006 10:16:15.056710 140369674590016 templates.py:267] Found an exact template match 1sv0_C. I1006 10:16:16.604707 140369674590016 templates.py:267] Found an exact template match 2e8p_A. I1006 10:16:16.611046 140369674590016 templates.py:267] Found an exact template match 5l0p_A. I1006 10:16:17.920321 140369674590016 templates.py:267] Found an exact template match 1wwu_A. I1006 10:16:17.986045 140369674590016 pipeline.py:234] Uniref90 MSA size: 2031 sequences. I1006 10:16:17.986166 140369674590016 pipeline.py:235] BFD MSA size: 1064 sequences. I1006 10:16:17.986241 140369674590016 pipeline.py:236] MGnify MSA size: 24 sequences. I1006 10:16:17.986308 140369674590016 pipeline.py:237] Final (deduplicated) MSA size: 2573 sequences. I1006 10:16:17.986528 140369674590016 pipeline.py:239] Total number of templates (NB: this can include bad templates and is later filtered to top 4): 20. I1006 10:16:20.240771 140369674590016 run_alphafold.py:190] Running model model_1_pred_0 on 1TEL_WT3 2022-10-06 10:16:23.663265: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/cuda/10.1/lib64:/apps/cuda/10.1/nvvm/lib64:/apps/cuda/10.1/jre/lib 2022-10-06 10:16:23.687921: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... I1006 10:16:24.459771 140369674590016 model.py:165] Running predict with shape(feat) = {'aatype': (4, 90), 'residue_index': (4, 90), 'seq_length': (4,), 'template_aatype': (4, 4, 90), 'template_all_atom_masks': (4, 4, 90, 37), 'template_all_atom_positions': (4, 4, 90, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 90), 'msa_mask': (4, 508, 90), 'msa_row_mask': (4, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 90, 3), 'template_pseudo_beta_mask': (4, 4, 90), 'atom14_atom_exists': (4, 90, 14), 'residx_atom14_to_atom37': (4, 90, 14), 'residx_atom37_to_atom14': (4, 90, 37), 'atom37_atom_exists': (4, 90, 37), 'extra_msa': (4, 5120, 90), 'extra_msa_mask': (4, 5120, 90), 'extra_msa_row_mask': (4, 5120), 'bert_mask': (4, 508, 90), 'true_msa': (4, 508, 90), 'extra_has_deletion': (4, 5120, 90), 'extra_deletion_value': (4, 5120, 90), 'msa_feat': (4, 508, 90, 49), 'target_feat': (4, 90, 22)} 2022-10-06 10:17:03.914926: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/cuda/10.1/lib64:/apps/cuda/10.1/nvvm/lib64:/apps/cuda/10.1/jre/lib Traceback (most recent call last): File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 422, in app.run(main) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 398, in main predict_structure( File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 198, in predict_structure prediction_result = model_runner.predict(processed_feature_dict, File "/nobackup/scratch/usr/tseng3/af2/alphafold-2.2.0/alphafold/model/model.py", line 167, in predict result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback return fun(*args, kwargs) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/_src/api.py", line 424, in cache_miss out_flat = xla.xla_call( File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/core.py", line 1560, in bind return call_bind(self, fun, *args, *params) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/core.py", line 1551, in call_bind outs = primitive.process(top_trace, fun, tracers, params) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/core.py", line 1563, in process return trace.process_call(self, fun, tracers, params) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/core.py", line 606, in process_call return primitive.impl(f, tracers, params) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 592, in _xla_call_impl compiled_fun = _xla_callable(fun, device, backend, name, donated_invars, File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/linear_util.py", line 262, in memoized_fun ans = call(fun, args) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 723, in _xla_callable out_nodes = jaxpr_subcomp( File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 462, in jaxpr_subcomp ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name), File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/_src/lax/control_flow.py", line 350, in _while_loop_translation_rule new_z = xla.jaxpr_subcomp(body_c, body_jaxpr.jaxpr, backend, axis_env, File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 462, in jaxpr_subcomp ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name), File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 1040, in f outs = jaxpr_subcomp(c, jaxpr, backend, axis_env, _xla_consts(c, consts), File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 462, in jaxpr_subcomp ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name), File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/_src/lax/control_flow.py", line 350, in _while_loop_translation_rule new_z = xla.jaxpr_subcomp(body_c, body_jaxpr.jaxpr, backend, axis_env, File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 453, in jaxpr_subcomp ans = rule(c, in_nodes, **eqn.params) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/_src/lax/linalg.py", line 500, in _eigh_cpu_gpu_translation_rule v, w, info = syevd_impl(c, operand, lower=lower) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jaxlib/cusolver.py", line 281, in syevd lwork, opaque = cusolver_kernels.build_syevj_descriptor( jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: cuSolver internal error

The stack trace below excludes JAX-internal frames. The preceding is the original exception that occurred, unmodified.


The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 422, in app.run(main) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 398, in main predict_structure( File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 198, in predict_structure prediction_result = model_runner.predict(processed_feature_dict, File "/nobackup/scratch/usr/tseng3/af2/alphafold-2.2.0/alphafold/model/model.py", line 167, in predict result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat) File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jaxlib/cusolver.py", line 281, in syevd lwork, opaque = cusolver_kernels.build_syevj_descriptor( RuntimeError: cuSolver internal error

avapirev commented 1 year ago

Considering that shared HPC clusters prefer Singularity/Apptainer and other root-less container managers (in order to avoid root with Docker) I do not see why there is no Singularity/Apptainer support. The requirements of AlphaFold are well beyond the compute capabilities of even small lab clusters, not to mention personal computers, where users might have root permissions.