Core dumped Error from ptas

KunFang93 commented 4 months ago

Hi,

Thanks for providing this wonderful tool! I folllowed the instructions and everything run smoothly until the predicting step:

(speed_ppi) [kfang@compgeno SpeedPPI]$ bash predict_single.sh /data/kfang/SLL/PPI/Q6N021.fasta /data/kfang/SLL/PPI/Q8CGY8.fasta ./hh-suite/bin/hhblits ./data/TET2_OGT/
MSAs exists...
Checking if all are present
Q6N021
./data/TET2_OGT//msas//Q6N021.a3m exists
Q8CGY8
./data/TET2_OGT//msas//Q8CGY8.a3m exists
Predicting...
/home/kfang/.conda/envs/mamba/envs/speed_ppi/lib/python3.12/site-packages/Bio/Data/SCOPData.py:18: BiopythonDeprecationWarning: The 'Bio.Data.SCOPData' module will be deprecated in a future release of Biopython in favor of 'Bio.Data.PDBData.
  warnings.warn(
2024-02-23 11:01:20.370198: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Evaluating pair Q6N021-Q8CGY8
2024-02-23 11:02:45.882470: F external/xla/xla/service/gpu/nvptx_compiler.cc:619] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas 12.3.103 has a bug that we think can affect XLA. Please use a different version.'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
predict_single.sh: line 51: 553887 Aborted                 (core dumped) python3 ./src/run_alphafold_single.py --complex_id $COMPLEX_ID --msa1 $MSA1 --msa2 $MSA2 --data_dir $DATADIR --max_recycles $RECYCLES --output_dir $OUTDIR

I tested

(speed_ppi) [kfang@compgeno SpeedPPI]$ python3 ./src/test_gpu_avail.py
gpu

I have searched online and found this thread mentioned this error. I tried same codes in the thread and got same error.

2024-02-23 11:04:04.314359: F external/xla/xla/service/gpu/nvptx_compiler.cc:619] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas 12.3.103 has a bug that we think can affect XLA. Please use a different version.'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
Aborted (core dumped)

I wondered if you could shed some light on how to solve this issue? Please let me know if any extra information needed. Thanks in advance!

Best, Kun

patrickbryant1 commented 4 months ago

Hi, I think this is related to the CUDA installation. What GPU do you have and what is the result of 'nvidia-smi'?

Best,

Patrick

KunFang93 commented 4 months ago

Hi,

Thanks for your reply! We have A100 and here is the Nvidia-smi info

(speed_ppi) [kfang@compgeno ~]$ nvidia-smi --query-gpu=name --format=csv,noheader
NVIDIA A100 80GB PCIe
NVIDIA A100 80GB PCIe
(speed_ppi) [kfang@compgeno ~]$ nvidia-smi
Sat Feb 24 09:50:48 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:22:00.0 Off |                    0 |
| N/A   41C    P0    47W / 300W |      0MiB / 81920MiB |      0%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:26:00.0 Off |                    0 |
| N/A   39C    P0    45W / 300W |      0MiB / 81920MiB |      0%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

However, when I use nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

It looks like nvcc show it is 12.3 but Nvidia-smi show the version is 12.0. I am a fresh user for using GPU. so I wondered if this could be the potential problem? And what should I do to fix it...

Thank you so much for helping!

Best, Kun

KunFang93 commented 4 months ago

I just updated cuda driver's version to 12.4 and reinstall speed_ppi conda environment but still fail with same error

(speed_ppi) [kfang@compgeno SpeedPPI]$ nvidia-smi
Sat Feb 24 10:47:11 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:22:00.0 Off |                    0 |
| N/A   42C    P0             48W /  300W |       0MiB /  81920MiB |      0%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  |   00000000:26:00.0 Off |                    0 |
| N/A   41C    P0             45W /  300W |       0MiB /  81920MiB |      0%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Thanks in advance!

Best, Kun

KunFang93 commented 4 months ago

I found a workaround way to solve this issue! I used

conda install jaxlib=*=*cuda* jax cuda-nvcc -c conda-forge -c nvidia

instead of

pip install --upgrade "jax[cuda12_local]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

The prediction run smoothly so far. I will report if anything goes wrong. Thanks!

KunFang93 commented 4 months ago

The above solution successfully generate pdb file. Thanks!

patrickbryant1 / SpeedPPI

Core dumped Error from ptas #20