Closed ryao-mdanderson closed 3 years ago
Hi @ryao-mdanderson
I am unable to find any error message here, all I could see is a KeyboardInterrupt
(pressing \<Ctrl> c, while the program is running for example). Can you please re-run it once again and see if it produces any error?
@sanjaysrikakulam 👍 Thank you very much for your helps so far, they are all very useful tips. I really appreciate!
My rerun hit the same error message, after you say there should not be such case, I realized the problem is on HPC, I submitted a job to the compute node, I requested only 1 cpu core and 8 Gb memory originally, which was not enough resources for this test run.
By looking into the error message, it invokes sub-process, from the output message information, at least 8 cpus are needed. I did a simple test to verify this. Now I got the test case works. Thank you so much!
Dear author:
I submited script in cluster,applied 2 gpus: sbatch --gpus=2 ./run_alphafold.sh -d /data/public/alphafold2 -o /data/home/scv0002/run/zhou/mutil2 -m model_1 -f ../INS_BOVIN.fasta -t 2020-05-14 -n 16 -a 0,1
but I found only one gpu used( -a 0,1),another one is free. So I need you help, How to set gpu parameters to run normally ? [@sanjaysrikakulam]
Hi @zhoujingyu13687306871
Please refer to this ticket: https://github.com/kalininalab/alphafold_non_docker/issues/10
Hi @zhoujingyu13687306871
Please refer to this ticket: #10
thanks , But I didn't find answers
Our non-docker setup script is only a wrapper around AF2 and our script makes sure to present both the GPUs to the AF2 and it is up to AF2 to use them or not, we do not modify the AF2 codebase. Please raise a ticket in the AF2 github repo or follow the discussions linked to the above ticket as this has nothing to do with our non-docker setup, AF2 might or might not use multiple GPUs at once.
Our non-docker setup script is only a wrapper around AF2 and our script makes sure to present both the GPUs to the AF2 and it is up to AF2 to use them or not, we do not modify the AF2 codebase. Please raise a ticket in the AF2 github repo or follow the discussions linked to the above ticket as this has nothing to do with our non-docker setup, AF2 might or might not use multiple GPUs at once.
@sanjaysrikakulam OK,I got it , thank you very much
Dear author:
I followed the README file an the following command (a cpu version)
$ conda activate alphafold (alphafold) [ryao@cdragon267 ryao]$ cd alphafold (alphafold) [ryao@cdragon267 alphafold]$ bash run_alphafold.sh -d ./alphafold_data -o ./dummy_test/ -m model_1 -f ./alphafold_non_docker/example/query.fasta -t 2020-05-14 -g False /risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/site-packages/absl/flags/_validators.py:203: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line! warnings.warn( I0810 15:31:03.155832 46912496434880 templates.py:836] Using precomputed obsolete pdbs ./alphafold_data/pdb_mmcif/obsolete.dat. I0810 15:31:03.363498 46912496434880 tpu_client.py:54] Starting the local TPU driver. I0810 15:31:03.373189 46912496434880 xla_bridge.py:231] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local:// 2021-08-10 15:31:03.374934: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /cm/local/apps/gcc/7.2.0/lib:/cm/local/apps/gcc/7.2.0/lib64:/rissched/lsf/10.1/linux3.10-glibc2.17-x86_64/lib 2021-08-10 15:31:03.374958: W external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) I0810 15:31:03.375049 46912496434880 xla_bridge.py:231] Unable to initialize backend 'gpu': Failed precondition: No visible GPU devices. I0810 15:31:03.375171 46912496434880 xla_bridge.py:231] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available. W0810 15:31:03.375225 46912496434880 xla_bridge.py:234] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) I0810 15:31:03.970467 46912496434880 run_alphafold.py:259] Have 1 models: ['model_1'] I0810 15:31:03.970602 46912496434880 run_alphafold.py:272] Using random seed 2888980253009115914 for the data pipeline I0810 15:31:03.976739 46912496434880 jackhmmer.py:130] Launching subprocess "/risapps/rhel7/python/3.7.3/envs/alphafold/bin/jackhmmer -o /dev/null -A /tmp/tmpg1fput7i/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ./alphafold_non_docker/example/query.fasta ./alphafold_data/uniref90/uniref90.fasta" I0810 15:31:03.989789 46912496434880 utils.py:36] Started Jackhmmer (uniref90.fasta) query I0810 15:38:11.871857 46912496434880 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 427.882 seconds I0810 15:38:11.872416 46912496434880 jackhmmer.py:130] Launching subprocess "/risapps/rhel7/python/3.7.3/envs/alphafold/bin/jackhmmer -o /dev/null -A /tmp/tmpslj920ny/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ./alphafold_non_docker/example/query.fasta ./alphafold_data/mgnify/mgy_clusters.fa" I0810 15:38:11.894569 46912496434880 utils.py:36] Started Jackhmmer (mgy_clusters.fa) query I0810 15:47:25.491852 46912496434880 utils.py:40] Finished Jackhmmer (mgy_clusters.fa) query in 553.597 seconds I0810 15:47:25.492514 46912496434880 hhsearch.py:76] Launching subprocess "/risapps/rhel7/python/3.7.3/envs/alphafold/bin/hhsearch -i /tmp/tmplmbbdtny/query.a3m -o /tmp/tmplmbbdtny/output.hhr -maxseq 1000000 -d ./alphafold_data/pdb70/pdb70" I0810 15:47:25.510776 46912496434880 utils.py:36] Started HHsearch query I0810 15:48:42.909016 46912496434880 utils.py:40] Finished HHsearch query in 77.398 seconds I0810 15:48:42.939602 46912496434880 hhblits.py:128] Launching subprocess "/risapps/rhel7/python/3.7.3/envs/alphafold/bin/hhblits -i ./alphafold_non_docker/example/query.fasta -cpu 4 -oa3m /tmp/tmp5sk1ch3o/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d ./alphafold_data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d ./alphafold_data/uniclust30/uniclust30_2018_08/uniclust30_2018_08" I0810 15:48:42.958906 46912496434880 utils.py:36] Started HHblits query
(alphafold) [ryao@cdragon267 alphafold]$ Traceback (most recent call last): File "/rsrch3/home/itops/ryao/alphafold/run_alphafold.py", line 302, in
app.run(main)
File "/risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/rsrch3/home/itops/ryao/alphafold/run_alphafold.py", line 276, in main
predict_structure(
File "/rsrch3/home/itops/ryao/alphafold/run_alphafold.py", line 126, in predict_structure
feature_dict = data_pipeline.process(
File "/rsrch3/home/itops/ryao/alphafold/alphafold/data/pipeline.py", line 173, in process
hhblits_bfd_uniclust_result = self.hhblits_bfd_uniclust_runner.query(
File "/rsrch3/home/itops/ryao/alphafold/alphafold/data/tools/hhblits.py", line 133, in query
stdout, stderr = process.communicate()
File "/risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/subprocess.py", line 1024, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File "/risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/subprocess.py", line 1866, in _communicate
ready = selector.select(timeout)
File "/risapps/rhel7/python/3.7.3/envs/alphafold/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
It exited. I run this command HPC environment on a compute node. May you suggest a possible cause for this situation?
Thanks!