kalininalab / alphafold_non_docker

AlphaFold2 non-docker setup
325 stars 119 forks source link

CUDA runtime implicit initialization on GPU:0 failed. Status: unrecognized error code #33

Closed Feng-Zhang closed 2 months ago

Feng-Zhang commented 2 years ago

Everything was followed by the guideline except for the jax installation, becaue it would throw out the error of ValueError: jaxlib is version 0.1.69, but this version of jax requires version 0.1.74.. We therefore use pip3 install --upgrade jax jaxlib>=0.1.69+cuda111 -f https://storage.googleapis.com/jax-releases/jax_releases.html to update it.

Then python run_alphafold_test.py is no problem, however bash run_alphafold.sh -d ../database-dir/ -o ../work-dir/ -f ../work-dir/T1050.fasta -t 2020-05-14 threw out a error as shown below:

$ bash run_alphafold.sh -d ../database-dir/ -o ../work-dir/ -f ../work-dir/T1050.fasta -t 2020-05-14
2022-01-24 11:28:29.659453: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
I0124 11:28:31.304372 140384586483520 templates.py:857] Using precomputed obsolete pdbs ../database-dir//pdb_mmcif/obsolete.dat.
I0124 11:28:31.476207 140384586483520 xla_bridge.py:244] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
I0124 11:28:31.641233 140384586483520 xla_bridge.py:244] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
I0124 11:28:37.127628 140384586483520 run_alphafold.py:384] Have 5 models: ['model_1', 'model_2', 'model_3', 'model_4', 'model_5']
I0124 11:28:37.127787 140384586483520 run_alphafold.py:397] Using random seed 324245886155445948 for the data pipeline
I0124 11:28:37.127987 140384586483520 run_alphafold.py:150] Predicting T1050
I0124 11:28:37.128321 140384586483520 jackhmmer.py:130] Launching subprocess "/home/aaron/bin/miniconda3/envs/alphafold_conda/bin/jackhmmer -o /dev/null -A /tmp/tmpomb2yx3m/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ../work-dir/T1050.fasta ../database-dir//uniref90/uniref90.fasta"
I0124 11:28:37.304794 140384586483520 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0124 11:33:27.010351 140384586483520 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 289.705 seconds
I0124 11:33:35.188940 140384586483520 jackhmmer.py:130] Launching subprocess "/home/aaron/bin/miniconda3/envs/alphafold_conda/bin/jackhmmer -o /dev/null -A /tmp/tmp_s62488h/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ../work-dir/T1050.fasta ../database-dir//mgnify/mgy_clusters_2018_12.fa"
I0124 11:33:35.408562 140384586483520 utils.py:36] Started Jackhmmer (mgy_clusters_2018_12.fa) query
I0124 11:38:51.483255 140384586483520 utils.py:40] Finished Jackhmmer (mgy_clusters_2018_12.fa) query in 316.074 seconds
I0124 11:39:20.663236 140384586483520 hhsearch.py:85] Launching subprocess "/home/aaron/bin/miniconda3/envs/alphafold_conda/bin/hhsearch -i /tmp/tmp_t0wr8md/query.a3m -o /tmp/tmp_t0wr8md/output.hhr -maxseq 1000000 -d ../database-dir//pdb70/pdb70"
I0124 11:39:20.892679 140384586483520 utils.py:36] Started HHsearch query
I0124 11:40:41.805876 140384586483520 utils.py:40] Finished HHsearch query in 80.913 seconds
I0124 11:42:19.997202 140384586483520 hhblits.py:128] Launching subprocess "/home/aaron/bin/miniconda3/envs/alphafold_conda/bin/hhblits -i ../work-dir/T1050.fasta -cpu 4 -oa3m /tmp/tmpdlpm19oi/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d ../database-dir//bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d ../database-dir//uniclust30/uniclust30_2018_08/uniclust30_2018_08"
I0124 11:42:20.263012 140384586483520 utils.py:36] Started HHblits query
I0124 12:37:39.983841 140384586483520 utils.py:40] Finished HHblits query in 3319.720 seconds
I0124 12:37:40.498109 140384586483520 templates.py:878] Searching for template for: MASQSYLFKHLEVSDGLSNNSVNTIYKDRDGFMWFGTTTGLNRYDGYTFKIYQHAENEPGSLPDNYITDIVEMPDGRFWINTARGYVLFDKERDYFITDVTGFMKNLESWGVPEQVFVDREGNTWLSVAGEGCYRYKEGGKRLFFSYTEHSLPEYGVTQMAECSDGILLIYNTGLLVCLDRATLAIKWQSDEIKKYIPGGKTIELSLFVDRDNCIWAYSLMGIWAYDCGTKSWRTDLTGIWSSRPDVIIHAVAQDIEGRIWVGKDYDGIDVLEKETGKVTSLVAHDDNGRSLPHNTIYDLYADRDGVMWVGTYKKGVSYYSESIFKFNMYEWGDITCIEQADEDRLWLGTNDHGILLWNRSTGKAEPFWRDAEGQLPNPVVSMLKSKDGKLWVGTFNGGLYCMNGSQVRSYKEGTGNALASNNVWALVEDDKGRIWIASLGGGLQCLEPLSGTFETYTSNNSALLENNVTSLCWVDDNTLFFGTASQGVGTMDMRTREIKKIQGQSDSMKLSNDAVNHVYKDSRGLVWIATREGLNVYDTRRHMFLDLFPVVEAKGNFIAAITEDQERNMWVSTSRKVIRVTVASDGKGSYLFDSRAYNSEDGLQNCDFNQRSIKTLHNGIIAIGGLYGVNIFAPDHIRYNKMLPNVMFTGLSLFDEAVKVGQSYGGRVLIEKELNDVENVEFDYKQNIFSVSFASDNYNLPEKTQYMYKLEGFNNDWLTLPVGVHNVTFTNLAPGKYVLRVKAINSDGYVGIKEATLGIVVNPPFKLAAALQHHHHHH
I0124 12:37:42.310647 140384586483520 templates.py:267] Found an exact template match 4a2m_B.
I0124 12:37:44.813723 140384586483520 templates.py:267] Found an exact template match 4a2l_F.
I0124 12:37:46.607977 140384586483520 templates.py:267] Found an exact template match 3v9f_B.
I0124 12:37:47.402505 140384586483520 templates.py:267] Found an exact template match 3va6_A.
I0124 12:37:48.522100 140384586483520 templates.py:267] Found an exact template match 3ott_B.
I0124 12:37:48.918042 140384586483520 templates.py:267] Found an exact template match 5m11_A.
I0124 12:37:48.945314 140384586483520 templates.py:267] Found an exact template match 4a2m_B.
I0124 12:37:48.974633 140384586483520 templates.py:267] Found an exact template match 4a2l_F.
I0124 12:37:49.003993 140384586483520 templates.py:267] Found an exact template match 4a2m_B.
I0124 12:37:49.033177 140384586483520 templates.py:267] Found an exact template match 4a2l_F.
I0124 12:37:49.062488 140384586483520 templates.py:267] Found an exact template match 5m11_A.
I0124 12:37:49.089267 140384586483520 templates.py:267] Found an exact template match 3v9f_B.
I0124 12:37:49.118885 140384586483520 templates.py:267] Found an exact template match 3ott_B.
I0124 12:37:49.148563 140384586483520 templates.py:267] Found an exact template match 3va6_A.
I0124 12:37:49.178556 140384586483520 templates.py:267] Found an exact template match 3ott_B.
I0124 12:37:49.207692 140384586483520 templates.py:267] Found an exact template match 3va6_A.
I0124 12:37:49.237412 140384586483520 templates.py:267] Found an exact template match 5m11_A.
I0124 12:37:49.264744 140384586483520 templates.py:267] Found an exact template match 4a2m_B.
I0124 12:37:49.293939 140384586483520 templates.py:267] Found an exact template match 4a2l_F.
I0124 12:37:49.322692 140384586483520 templates.py:267] Found an exact template match 3v9f_B.
I0124 12:37:51.470793 140384586483520 pipeline.py:221] Uniref90 MSA size: 10000 sequences.
I0124 12:37:51.470931 140384586483520 pipeline.py:222] BFD MSA size: 4966 sequences.
I0124 12:37:51.470967 140384586483520 pipeline.py:223] MGnify MSA size: 501 sequences.
I0124 12:37:51.471006 140384586483520 pipeline.py:224] Final (deduplicated) MSA size: 15406 sequences.
I0124 12:37:51.471178 140384586483520 pipeline.py:226] Total number of templates (NB: this can include bad templates and is later filtered to top 4): 20.
I0124 12:37:52.176468 140384586483520 run_alphafold.py:185] Running model model_1 on T1050
2022-01-24 12:37:54.502811: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-01-24 12:37:54.504019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: RTX A6000 computeCapability: 8.6
coreClock: 1.8GHz coreCount: 84 deviceMemorySize: 47.54GiB deviceMemoryBandwidth: 715.34GiB/s
2022-01-24 12:37:54.504059: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-24 12:37:54.505758: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2022-01-24 12:37:54.505845: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2022-01-24 12:37:54.505870: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2022-01-24 12:37:54.506045: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2022-01-24 12:37:54.507774: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2022-01-24 12:37:54.508212: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2022-01-24 12:37:54.508344: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2022-01-24 12:37:54.510542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2022-01-24 12:37:54.559168: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-24 12:37:54.563017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: RTX A6000 computeCapability: 8.6
coreClock: 1.8GHz coreCount: 84 deviceMemorySize: 47.54GiB deviceMemoryBandwidth: 715.34GiB/s
2022-01-24 12:37:54.565146: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2022-01-24 12:37:54.565185: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-01-24 12:37:54.626365: E tensorflow/core/common_runtime/session.cc:91] Failed to create session: Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: unrecognized error code
2022-01-24 12:37:54.626381: E tensorflow/c/c_api.cc:2193] Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: unrecognized error code
Traceback (most recent call last):
  File "/mnt/disk4T/alphafold-project/alphafold_conda/run_alphafold.py", line 427, in <module>
    app.run(main)
  File "/home/aaron/bin/miniconda3/envs/alphafold_conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/aaron/bin/miniconda3/envs/alphafold_conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/mnt/disk4T/alphafold-project/alphafold_conda/run_alphafold.py", line 403, in main
    predict_structure(
  File "/mnt/disk4T/alphafold-project/alphafold_conda/run_alphafold.py", line 188, in predict_structure
    processed_feature_dict = model_runner.process_features(
  File "/mnt/disk4T/alphafold-project/alphafold_conda/alphafold/model/model.py", line 131, in process_features
    return features.np_example_to_features(
  File "/mnt/disk4T/alphafold-project/alphafold_conda/alphafold/model/features.py", line 101, in np_example_to_features
    with tf.Session(graph=tf_graph) as sess:
  File "/home/aaron/bin/miniconda3/envs/alphafold_conda/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1596, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/aaron/bin/miniconda3/envs/alphafold_conda/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 711, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: unrecognized error code

The code is run on ubuntu 20.04, and nvidia-smi information is :

Mon Jan 24 15:10:12 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00   Driver Version: 460.106.00   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  RTX A6000           On   | 00000000:02:00.0  On |                  Off |
| 30%   30C    P8    23W / 300W |    206MiB / 48676MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1778      G   /usr/lib/xorg/Xorg                147MiB |
|    0   N/A  N/A     13768      G   /usr/bin/gnome-shell               32MiB |
|    0   N/A  N/A    969147      G   ...nlogin/bin/sunloginclient        6MiB |
|    0   N/A  N/A   1048268      G   ...AAAAAAAAA= --shared-files       17MiB |
+-----------------------------------------------------------------------------+
Old-Shatterhand commented 2 months ago

Hey @Feng-Zhang, since AlphaFold starts successfully with the database search and trying to do some inference (the tensorflow code is called by AlphaFold), I assure this is an issue with AlphaFold and not the docker-free installation. Were you able to train/use other neural networks based on tensorflow with the above mentioned installation of jax?

I'll close the issue for now, if you believe it's an installation issue, you can reopen it.