kalininalab / alphafold_non_docker

AlphaFold2 non-docker setup
331 stars 119 forks source link

run run_alphafold.sh error message #8

Closed ryao-mdanderson closed 2 years ago

ryao-mdanderson commented 2 years ago

Dear author:

I followed the README file an the following command (a cpu version)

$ conda activate alphafold (alphafold) [ryao@cdragon267 ryao]$ cd alphafold (alphafold) [ryao@cdragon267 alphafold]$ bash run_alphafold.sh -d ./alphafold_data -o ./dummy_test/ -m model_1 -f ./alphafold_non_docker/example/query.fasta -t 2020-05-14 -g False /risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/site-packages/absl/flags/_validators.py:203: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line! warnings.warn( I0810 15:31:03.155832 46912496434880 templates.py:836] Using precomputed obsolete pdbs ./alphafold_data/pdb_mmcif/obsolete.dat. I0810 15:31:03.363498 46912496434880 tpu_client.py:54] Starting the local TPU driver. I0810 15:31:03.373189 46912496434880 xla_bridge.py:231] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local:// 2021-08-10 15:31:03.374934: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/local/apps/gcc/7.2.0/lib:/cm/local/apps/gcc/7.2.0/lib64:/rissched/lsf/10.1/linux3.10-glibc2.17-x86_64/lib 2021-08-10 15:31:03.374958: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) I0810 15:31:03.375049 46912496434880 xla_bridge.py:231] Unable to initialize backend 'gpu': Failed precondition: No visible GPU devices. I0810 15:31:03.375171 46912496434880 xla_bridge.py:231] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available. W0810 15:31:03.375225 46912496434880 xla_bridge.py:234] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) I0810 15:31:03.970467 46912496434880 run_alphafold.py:259] Have 1 models: ['model_1'] I0810 15:31:03.970602 46912496434880 run_alphafold.py:272] Using random seed 2888980253009115914 for the data pipeline I0810 15:31:03.976739 46912496434880 jackhmmer.py:130] Launching subprocess "/risapps/rhel7/python/3.7.3/envs/alphafold/bin/jackhmmer -o /dev/null -A /tmp/tmpg1fput7i/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ./alphafold_non_docker/example/query.fasta ./alphafold_data/uniref90/uniref90.fasta" I0810 15:31:03.989789 46912496434880 utils.py:36] Started Jackhmmer (uniref90.fasta) query I0810 15:38:11.871857 46912496434880 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 427.882 seconds I0810 15:38:11.872416 46912496434880 jackhmmer.py:130] Launching subprocess "/risapps/rhel7/python/3.7.3/envs/alphafold/bin/jackhmmer -o /dev/null -A /tmp/tmpslj920ny/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ./alphafold_non_docker/example/query.fasta ./alphafold_data/mgnify/mgy_clusters.fa" I0810 15:38:11.894569 46912496434880 utils.py:36] Started Jackhmmer (mgy_clusters.fa) query I0810 15:47:25.491852 46912496434880 utils.py:40] Finished Jackhmmer (mgy_clusters.fa) query in 553.597 seconds I0810 15:47:25.492514 46912496434880 hhsearch.py:76] Launching subprocess "/risapps/rhel7/python/3.7.3/envs/alphafold/bin/hhsearch -i /tmp/tmplmbbdtny/query.a3m -o /tmp/tmplmbbdtny/output.hhr -maxseq 1000000 -d ./alphafold_data/pdb70/pdb70" I0810 15:47:25.510776 46912496434880 utils.py:36] Started HHsearch query I0810 15:48:42.909016 46912496434880 utils.py:40] Finished HHsearch query in 77.398 seconds I0810 15:48:42.939602 46912496434880 hhblits.py:128] Launching subprocess "/risapps/rhel7/python/3.7.3/envs/alphafold/bin/hhblits -i ./alphafold_non_docker/example/query.fasta -cpu 4 -oa3m /tmp/tmp5sk1ch3o/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d ./alphafold_data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d ./alphafold_data/uniclust30/uniclust30_2018_08/uniclust30_2018_08" I0810 15:48:42.958906 46912496434880 utils.py:36] Started HHblits query

(alphafold) [ryao@cdragon267 alphafold]$ Traceback (most recent call last): File "/rsrch3/home/itops/ryao/alphafold/run_alphafold.py", line 302, in app.run(main) File "/risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/rsrch3/home/itops/ryao/alphafold/run_alphafold.py", line 276, in main predict_structure( File "/rsrch3/home/itops/ryao/alphafold/run_alphafold.py", line 126, in predict_structure feature_dict = data_pipeline.process( File "/rsrch3/home/itops/ryao/alphafold/alphafold/data/pipeline.py", line 173, in process hhblits_bfd_uniclust_result = self.hhblits_bfd_uniclust_runner.query( File "/rsrch3/home/itops/ryao/alphafold/alphafold/data/tools/hhblits.py", line 133, in query stdout, stderr = process.communicate() File "/risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/subprocess.py", line 1024, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File "/risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/subprocess.py", line 1866, in _communicate ready = selector.select(timeout) File "/risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) KeyboardInterrupt

It exited. I run this command HPC environment on a compute node. May you suggest a possible cause for this situation?

Thanks!

sanjaysrikakulam commented 2 years ago

Hi @ryao-mdanderson

I am unable to find any error message here, all I could see is a KeyboardInterrupt (pressing \<Ctrl> c, while the program is running for example). Can you please re-run it once again and see if it produces any error?

ryao-mdanderson commented 2 years ago

@sanjaysrikakulam 👍 Thank you very much for your helps so far, they are all very useful tips. I really appreciate!

My rerun hit the same error message, after you say there should not be such case, I realized the problem is on HPC, I submitted a job to the compute node, I requested only 1 cpu core and 8 Gb memory originally, which was not enough resources for this test run.

By looking into the error message, it invokes sub-process, from the output message information, at least 8 cpus are needed. I did a simple test to verify this. Now I got the test case works. Thank you so much!

zhoujingyu13687306871 commented 2 years ago

Dear author:

I submited script in cluster,applied 2 gpus: sbatch --gpus=2 ./run_alphafold.sh -d /data/public/alphafold2 -o /data/home/scv0002/run/zhou/mutil2 -m model_1 -f ../INS_BOVIN.fasta -t 2020-05-14 -n 16 -a 0,1

but I found only one gpu used( -a 0,1),another one is free. So I need you help, How to set gpu parameters to run normally ? [@sanjaysrikakulam]

sanjaysrikakulam commented 2 years ago

Hi @zhoujingyu13687306871

Please refer to this ticket: https://github.com/kalininalab/alphafold_non_docker/issues/10

zhoujingyu13687306871 commented 2 years ago

Hi @zhoujingyu13687306871

Please refer to this ticket: #10

thanks , But I didn't find answers

sanjaysrikakulam commented 2 years ago

Our non-docker setup script is only a wrapper around AF2 and our script makes sure to present both the GPUs to the AF2 and it is up to AF2 to use them or not, we do not modify the AF2 codebase. Please raise a ticket in the AF2 github repo or follow the discussions linked to the above ticket as this has nothing to do with our non-docker setup, AF2 might or might not use multiple GPUs at once.

zhoujingyu13687306871 commented 2 years ago

Our non-docker setup script is only a wrapper around AF2 and our script makes sure to present both the GPUs to the AF2 and it is up to AF2 to use them or not, we do not modify the AF2 codebase. Please raise a ticket in the AF2 github repo or follow the discussions linked to the above ticket as this has nothing to do with our non-docker setup, AF2 might or might not use multiple GPUs at once.

@sanjaysrikakulam OK,I got it , thank you very much