hpcaitech / FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters
Apache License 2.0
556 stars 84 forks source link

cuda error #186

Open hhh12345678990 opened 5 months ago

hhh12345678990 commented 5 months ago

当我运行python inference.py mypdb.fasta data/pdb_mmcif/mmcif_files/ \ --use_precomputed_alignments ./alignments \ --output_dir ./ \ --gpus 4 \ --model_preset multimer \ --uniref90_database_path data/uniref90/uniref90.fasta \ --mgnify_database_path data/mgnify/mgy_clusters_2022_05.fa \ --pdb70_database_path data/pdb70/pdb70 \ --uniref30_database_path data/uniref30/UniRef30_2021_03 \ --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --uniprot_database_path data/uniprot/uniprot.fasta \ --pdb_seqres_database_path data/pdb_seqres/pdb_seqres.txt \ --param_path data/params/params_model_1_multimer_v3.npz \ --model_name model_1_multimer_v3 \ --jackhmmer_binary_path which jackhmmer \ --hhblits_binary_path which hhblits \ --hhsearch_binary_path which hhsearch \ --kalign_binary_path which kalign \ --enable_workflow \ --inplace 报错running in multimer mode... [01/26/24 20:11:10] INFO colossalai - colossalai - INFO:
/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/colo ssalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[01/26/24 20:11:10] INFO colossalai - colossalai - INFO:
/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/colo ssalai/context/parallel_context.py:521 set_device
[01/26/24 20:11:10] INFO colossalai - colossalai - INFO:
/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/colo ssalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
[01/26/24 20:11:10] INFO colossalai - colossalai - INFO:
/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/colo ssalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[01/26/24 20:11:12] INFO colossalai - colossalai - INFO:
/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/colo ssalai/context/parallel_context.py:557 set_seed
[01/26/24 20:11:12] INFO colossalai - colossalai - INFO:
/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/colo ssalai/context/parallel_context.py:557 set_seed
[01/26/24 20:11:12] INFO colossalai - colossalai - INFO:
/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/colo ssalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 2, numpy:
1024, python random: 1024, ParallelMode.DATA: 1024,
ParallelMode.TENSOR: 1026,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 3, numpy:
1024, python random: 1024, ParallelMode.DATA: 1024,
ParallelMode.TENSOR: 1027,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy:
1024, python random: 1024, ParallelMode.DATA: 1024,
ParallelMode.TENSOR: 1025,the default parallel seed is
ParallelMode.DATA.
[01/26/24 20:11:12] INFO colossalai - colossalai - INFO:
/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/colo ssalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy:
1024, python random: 1024, ParallelMode.DATA: 1024,
ParallelMode.TENSOR: 1024,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/colo ssalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is
initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 4
Traceback (most recent call last): File "inference.py", line 556, in main(args) File "inference.py", line 164, in main inference_multimer_model(args) File "inference.py", line 293, in inference_multimer_model torch.multiprocessing.spawn(inference_model, nprocs=args.gpus, args=(args.gpus, result_q, batch, args)) File "/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/home/khuang/video/FastFold-main/inference.py", line 151, in inference_model out = model(batch) File "/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/khuang/video/FastFold-main/fastfold/model/hub/alphafold.py", line 522, in forward outputs, m_1_prev, z_prev, x_prev = self.iteration( File "/home/khuang/video/FastFold-main/fastfold/model/hub/alphafold.py", line 209, in iteration else self.input_embedder(feats) File "/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/khuang/video/FastFold-main/fastfold/model/nn/embedders_multimer.py", line 141, in forward tf_emb_i = self.linear_tf_z_i(tf) File "/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/khuang/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

有遇到过的嘛