hpcaitech / FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters
Apache License 2.0
575 stars 86 forks source link

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( #182

Open cdsnow opened 1 year ago

cdsnow commented 1 year ago

Greetings!

Following the instructions, I've completed an installation and everything seemed to work including the generation of the MSA. Specifically, I've done the recommended conda installation, the pip installation of triton, and the local download/unpack of the datasets. Per my reading, the remainder of the instructions (e.g. Docker) seemed optional, so I jumped directly to trying inference.sh.

However, I'm hitting a repeatable Runtime CUDA error. Since the same error occurs when I try the benchmark run, I'll paste the output for that at the bottom. Keeping an eye on the VRAM, this does not seem to be an issue involving a lack of memory on the GPU (a RTX 3090) | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |

(fastfold) csnow@icestorm:~/code/FastFold/benchmark$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Any advice!? Best wishes, -Chris

(fastfold) csnow@icestorm:~/code/FastFold/benchmark$ torchrun --nproc_per_node=1 perf.py --msa-length 128 --res-length 256 [08/25/23 10:33:06] INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[08/25/23 10:33:07] INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
Traceback (most recent call last): File "perf.py", line 187, in main() File "perf.py", line 152, in main layer_inputs = attn_layers[lyr_idx].forward(layer_inputs, node_mask, pair_mask) File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/evoformer.py", line 65, in forward m = self.msa(m, z, msa_mask) File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/msa.py", line 143, in forward node = self.MSARowAttentionWithPairBias(node, pair, node_mask_row) File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/msa.py", line 63, in forward b = F.linear(Z, self.linear_b_weights) RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 429806) of binary: /home/csnow/anaconda3/envs/fastfold/bin/python Traceback (most recent call last): File "/home/csnow/anaconda3/envs/fastfold/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(args, kwargs) File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

perf.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-25_10:33:10 host : icestorm rank : 0 (local_rank: 0) exitcode : 1 (pid: 429806) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
addsg commented 1 year ago

hi, I met the same problem as you. And I solved it at last. I think you need to check whether your cuda version is matched with this project. In this project, the torch version is 1.12.1 , it means that your cuda version must be one of [10.2 11.3 11.6]

bj600800 commented 7 months ago

find you cudatoolkit location, which nvcc. Make sure you are calling the cudatoolkit=11.3