kalininalab / alphafold_non_docker

AlphaFold2 non-docker setup
331 stars 119 forks source link

No GPU/TPU found, falling back to CPU on GPU with 4Gb of RAM #13

Closed avilella closed 2 years ago

avilella commented 2 years ago

I am trying to run the non-docker version of alphafold2 in this repo: I succeeded in doing so in an AWS GPU instance with a GPU of 16Gb of RAM, and for the proteins I am inputting, it peaks at around 3Gb of RAM utilisation, by looking at nvidia-smi while alphafold2 is running.

I am now trying the same with a laptop that has an Nvidia GPU with 4Gb of RAM (see info below), but so far, I am unable to make the same run_alphafold command to see the GPU. Any ideas?:


I0820 11:22:14.030323 140191155447616 templates.py:836] Using precomputed obsolete pdbs /bfx_share1/quick_share/alphafold2/db/pdb_mmcif/obsolete.dat.
I0820 11:22:14.230584 140191155447616 xla_bridge.py:236] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: 
2021-08-20 11:22:14.253180: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
I0820 11:22:14.253384 140191155447616 xla_bridge.py:236] Unable to initialize backend 'gpu': Failed precondition: No visible GPU devices.
I0820 11:22:14.253819 140191155447616 xla_bridge.py:236] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
W0820 11:22:14.253926 140191155447616 xla_bridge.py:240] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
I0820 11:22:15.007403 140191155447616 run_alphafold.py:259] Have 1 models: ['model_1']
I0820 11:22:15.007551 140191155447616 run_alphafold.py:272] Using random seed 3180855101326110185 for the data pipeline
I0820 11:22:15.008080 140191155447616 jackhmmer.py:130] Launching subprocess "/home/user/miniconda3/envs/alphafold/bin/jackhmmer -o /dev/null -A /tmp/tmpjdujyngs/output.sto --noali --F1 0.0005 --F2 5e-05 --F
3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /home/user/alphafold/CL-1384189538793.fasta /bfx_share1/quick_share/alphafold2/db/uniref90/uniref90.fasta"
I0820 11:22:15.019448 140191155447616 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0820 11:22:16.779201 140191155447616 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 1.760 seconds
I0820 11:22:16.786322 140191155447616 jackhmmer.py:130] Launching subprocess "/home/user/miniconda3/envs/alphafold/bin/jackhmmer -o /dev/null -A /tmp/tmpvmikh78k/output.sto --noali --F1 0.0005 --F2 5e-05 --F
3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /home/user/alphafold/CL-1384189538793.fasta /bfx_share1/quick_share/alphafold2/db/mgnify/mgy_clusters.fa"
I0820 11:22:16.797401 140191155447616 utils.py:36] Started Jackhmmer (mgy_clusters.fa) query

$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001F91sv000017AAsd00003A41bc03sc00i00
vendor   : NVIDIA Corporation
model    : TU117M [GeForce GTX 1650 Mobile / Max-Q]
driver   : nvidia-driver-460-server - distro non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-470 - distro non-free recommended
driver   : nvidia-driver-460 - distro non-free
driver   : nvidia-driver-418-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

$ nvidia-smi
Fri Aug 20 11:22:38 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   38C    P8     3W /  N/A |    148MiB /  3903MiB |      1%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1331      G   /usr/lib/xorg/Xorg                 55MiB |
|    0   N/A  N/A      1373      G   /usr/bin/sddm-greeter              88MiB |
+-----------------------------------------------------------------------------+
sanjaysrikakulam commented 2 years ago

Hi @avilella

Can you try to reboot the system and then try running the command again? If that does not work, maybe try uninstalling, reboot, reinstalling CUDA and cuDNN and reboot and then run the AF2. Sometimes some packages might not have installed properly.

This is the error TensorFlow is raising when you are trying to run it: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

This could be due to several factors so maybe trying to reinstall the required packages might fix the issue.

avilella commented 2 years ago

Here is what I tried:

I did a sudo apt remove of the nvidia and cuda related drivers. Then rebooted. Then installed sudo apt install the nvidia driver nvidia-drivers-460, which is actually not the recommended by the command ubuntu-drivers devices, which in my case was nvidia-drivers-470.

I re-ran the run_alphafold script, and this time it recognised the GPU (it only says it can't find the TPU, which there isn't one). It then goes on to the HHsearch step, which takes a few minutes, then it failed for a later step at the prediction step. Googling for the error, it seems it now needed update cuda drivers. I reinstalled sudo apt install nvidia-cuda-toolkit, rebooted, and tried again.

After that, it seems to work, and I can see nvidia-smi taking up memory as the predict step goes on.

Thanks for the advice!