hpcaitech / FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters
Apache License 2.0
575 stars 86 forks source link

About error of inference question #28

Open zhoujingyu13687306871 opened 2 years ago

zhoujingyu13687306871 commented 2 years ago

Hi! I installed the conda environment according to the content in the READMD, and then wrote a script to infer the protein structure. The content of the script is as follows. After submitting the script, the following error will be reported. Please help to find out what caused it?

content of script

#DSUB -q root.default
#DSUB -R 'cpu=12;gpu=2;mem=96000'
#DSUB -l wuhanG5500
#DSUB -N 1
#DSUB -e %J.out
#DSUB -o %J.out
module load anaconda/2020.11
module load cuda/11.5.0-gcc-4.8.5-atd
module load gcc/8.3.0-gcc-4.8.5-cpp
source activate fastfold
af2Root=/home/bingxing2/public/alphafold2.1.1
torchrun --nproc_per_node=2 ./inference.py multi.fasta $af2Root/pdb_mmcif/mmcif_files \
 --output_dir ./out \
 --model_name model_1 \
 --param_path $af2Root/params/params_model_1_multimer.npz \
 --cpus 2 \
 --uniref90_database_path $af2Root/uniref90/uniref90.fasta \
 --mgnify_database_path $af2Root/mgnify/mgy_clusters.fa \
 --pdb70_database_path $af2Root/pdb70/pdb70 \
 --uniclust30_database_path $af2Root/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
 --bfd_database_path $af2Root/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
 --jackhmmer_binary_path `which jackhmmer` \
 --hhblits_binary_path `which hhblits` \
 --hhsearch_binary_path `which hhsearch` \
 --kalign_binary_path `which kalign`

error:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[06/29/2022 01:17:31 AM] INFO     colossalai - colossalai - INFO: /home/bingxing
                                  2/gpuuser001/.conda/envs/fastfold/lib/python3.
                                  8/site-packages/colossalai/context/parallel_co
                                  ntext.py:519 set_device
                         INFO     colossalai - colossalai - INFO: process rank 1
                                  is bound to device 1
[06/29/2022 01:17:31 AM] INFO     colossalai - colossalai - INFO: /home/bingxing
                                  2/gpuuser001/.conda/envs/fastfold/lib/python3.
                                  8/site-packages/colossalai/context/parallel_co
                                  ntext.py:519 set_device
                         INFO     colossalai - colossalai - INFO: process rank 0
                                  is bound to device 0
[06/29/2022 01:17:33 AM] INFO     colossalai - colossalai - INFO: /home/bingxing
                                  2/gpuuser001/.conda/envs/fastfold/lib/python3.
                                  8/site-packages/colossalai/context/parallel_co
                                  ntext.py:555 set_seed
                         INFO     colossalai - colossalai - INFO: initialized
                                  seed on rank 1, numpy: 1024, python random:
                                  1024, ParallelMode.DATA: 1024,
                                  ParallelMode.TENSOR: 1025,the default parallel
                                  seed is ParallelMode.DATA.
[06/29/2022 01:17:33 AM] INFO     colossalai - colossalai - INFO: /home/bingxing
                                  2/gpuuser001/.conda/envs/fastfold/lib/python3.
                                  8/site-packages/colossalai/context/parallel_co
                                  ntext.py:555 set_seed
                         INFO     colossalai - colossalai - INFO: initialized
                                  seed on rank 0, numpy: 1024, python random:
                                  1024, ParallelMode.DATA: 1024,
                                  ParallelMode.TENSOR: 1024,the default parallel
                                  seed is ParallelMode.DATA.
                         INFO     colossalai - colossalai - INFO: /home/bingxing
                                  2/gpuuser001/.conda/envs/fastfold/lib/python3.
                                  8/site-packages/colossalai/initialize.py:112
                                  launch
                         INFO     colossalai - colossalai - INFO: Distributed
                                  environment is initialized, data parallel
                                  size: 1, pipeline parallel size: 1, tensor
                                  parallel size: 2
Traceback (most recent call last):
  File "./inference.py", line 266, in <module>
    main(args)Traceback (most recent call last):

  File "./inference.py", line 266, in <module>
  File "./inference.py", line 82, in main
    main(args)
import_jax_weights_(model, args.param_path, version=args.model_name)  File "./inference.py", line 82, in main

    import_jax_weights_(model, args.param_path, version=args.model_name)
  File "/home/bingxing2/gpuuser001/zhou/FastFold/fastfold/utils/import_weights.py", line 445, in import_jax_weights_
  File "/home/bingxing2/gpuuser001/zhou/FastFold/fastfold/utils/import_weights.py", line 445, in import_jax_weights_
    assert len(incorrect) == 0
    AssertionErrorassert len(incorrect) == 0

AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13987) of binary: /home/bingxing2/gpuuser001/.conda/envs/fastfold/bin/python
Traceback (most recent call last):
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./inference.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-06-29_01:17:42
  host      : gpu09
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 13988)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-06-29_01:17:42
  host      : gpu09
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13987)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Shenggan commented 2 years ago

Feel sorry, this version of FastFold can not support multimer.

zhoujingyu13687306871 commented 2 years ago

ok I Try monomer model

---- Replied Message ---- | From | @.> | | Date | 06/29/2022 11:04 | | To | @.> | | Cc | @.**@.> | | Subject | Re: [hpcaitech/FastFold] About error of inference question (Issue #28) |

Feel sorry, this version of FastFold can not support multimer.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

zhoujingyu13687306871 commented 2 years ago

Feel sorry, this version of FastFold can not support multimer.

hi, I replaced multi.fasta to mono.fasta, which has only one protein chain, but there are still errors, as follows:

                         INFO     colossalai - colossalai - INFO: Distributed
                                  environment is initialized, data parallel
                                  size: 1, pipeline parallel size: 1, tensor
                                  parallel size: 2
Traceback (most recent call last):
  File "./inference.py", line 266, in <module>
    main(args)
  File "./inference.py", line 82, in main
    import_jax_weights_(model, args.param_path, version=args.model_name)
  File "/home/bingxing2/gpuuser001/zhou/FastFold/fastfold/utils/import_weights.py", line 445, in import_jax_weights_
    assert len(incorrect) == 0
AssertionError
Traceback (most recent call last):
  File "./inference.py", line 266, in <module>
    main(args)
  File "./inference.py", line 82, in main
    import_jax_weights_(model, args.param_path, version=args.model_name)
  File "/home/bingxing2/gpuuser001/zhou/FastFold/fastfold/utils/import_weights.py", line 445, in import_jax_weights_
    assert len(incorrect) == 0
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 51631) of binary: /home/bingxing2/gpuuser001/.conda/envs/fastfold/bin/python
Traceback (most recent call last):
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/bingxing2/gpuuser001/.conda/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Shenggan commented 2 years ago

From the error message, I think there was an error when loading the weights of alphafold. Can you provide the version of your weights and how you get the weights?

zhoujingyu13687306871 commented 2 years ago

From the error message, I think there was an error when loading the weights of alphafold. Can you provide the version of your weights and how you get the weights?

did you say the path of params in AF2.1.1 ?

Shenggan commented 2 years ago

I suppose you need to set param_path to the path of a monomer model like params_model_1.npz for model_1.

zhoujingyu13687306871 commented 2 years ago

I suppose you need to set param_path to the path of a monomer model like params_model_1.npz for model_1.

Hi! I re-modified the script as you suggested, and tried it once, I found that there are 2 pdb files and 1 alignment in the output file. Is this normal? Are there any other output files?

here is output content: alignments T1078 Tsp1, Trichoderma virens, 138 residues|_model_1_relaxed.pdb T1078 Tsp1, Trichoderma virens, 138 residues|_model_1_unrelaxed.pdb

Shenggan commented 2 years ago

These outputs are as expected. The two pdb files correspond to the protein files before and after relax, respectively.

zhoujingyu13687306871 commented 2 years ago

These outputs are as expected. The two pdb files correspond to the protein files before and after relax, respectively.

no plddt file ?

zhoujingyu13687306871 commented 2 years ago

--model_name model_1 \ --param_path $af2Root/params/params_model_1.npz \

How can I modify it to predict 5 models at once?

Shenggan commented 2 years ago

No save for plddt for now, but you can print by yourself after https://github.com/hpcaitech/FastFold/blob/main/inference.py#L185-L186

And, I think you may need to extend the inference scripts to predict multiple models at once. Or you can just use a shell scrips to run the inference scripts multiple times.

zhoujingyu13687306871 commented 2 years ago

No save for plddt for now, but you can print by yourself after https://github.com/hpcaitech/FastFold/blob/main/inference.py#L185-L186

And, I think you may need to extend the inference scripts to predict multiple models at once. Or you can just use a shell scrips to run the inference scripts multiple times.

OK ,I try it, thank you

zbuster05 commented 2 years ago

Hi! When using Alphafold parameters for models 3, 4, and 5 I get the

assert len(incorrect) == 0 AssertionErrorassert len(incorrect) == 0

error as well, but not for models 1 or 2. I was wondering if this might be related? I am operating on monomeric.

lzhangUT commented 2 years ago

@zhoujingyu13687306871 Hi I am having the same issue as you had. would you mind sharing what you changed in your script to make it work? 1) From your script, it seems like that you've download the alphafold first with "af2Root=/home/bingxing2/public/alphafold2.1.1" ? and 2) how did you change the param_path to make it work? I would really appreciate your work

zhoujingyu13687306871 commented 2 years ago

@zhoujingyu13687306871 Hi I am having the same issue as you had. would you mind sharing what you changed in your script to make it work?

  1. From your script, it seems like that you've download the alphafold first with "af2Root=/home/bingxing2/public/alphafold2.1.1" ? and 2) how did you change the param_path to make it work? I would really appreciate your work

here is my script: #####################AF2计算部分################################################### module load anaconda/2020.11 module load cuda/11.5.0-gcc-4.8.5-atd module load gcc/8.3.0-gcc-4.8.5-cpp source activate fastfold af2Root=/home/bingxing2/public/alphafold2.1.1 torchrun --nproc_per_node=2 ./inference.py mono.fasta $af2Root/pdb_mmcif/mmcif_files \ --output_dir ./out \ --model_name model_1 \ --param_path $af2Root/params/params_model_1.npz \ --cpus 12 \ --uniref90_database_path $af2Root/uniref90/uniref90.fasta \ --mgnify_database_path $af2Root/mgnify/mgy_clusters.fa \ --pdb70_database_path $af2Root/pdb70/pdb70 \ --uniclust30_database_path $af2Root/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --bfd_database_path $af2Root/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --jackhmmer_binary_path which jackhmmer \ --hhblits_binary_path which hhblits \ --hhsearch_binary_path which hhsearch \ --kalign_binary_path which kalign

lzhangUT commented 2 years ago

Thanks very much! @zhoujingyu13687306871 so you downloaded the alphafold following the instruction here first? https://github.com/deepmind/alphafold/

lzhangUT commented 2 years ago

and as you asked, --model_name model_1 --param_path $af2Root/params/params_model_1.npz \

How did you modify it to predict 5 models at once?

lzhangUT commented 2 years ago

did you just repeat the 'torchrun ' command but with replacement of the model and model paramters?

zhoujingyu13687306871 commented 2 years ago

Thanks very much! @zhoujingyu13687306871 so you downloaded the alphafold following the instruction here first? https://github.com/deepmind/alphafold/

yes

zhoujingyu13687306871 commented 2 years ago

and as you asked, --model_name model_1 --param_path $af2Root/params/params_model_1.npz \

How did you modify it to predict 5 models at once?

I'm running script, but I'm not sure if it will work well, it will take a while, please hold on ############################################################### module load anaconda/2020.11 module load cuda/11.5.0-gcc-4.8.5-atd module load gcc/8.3.0-gcc-4.8.5-cpp source activate fastfold for i in $(seq 1 5);do af2Root=/home/bingxing2/public/alphafold2.1.1 torchrun --nproc_per_node=2 ./inference.py mono.fasta $af2Root/pdb_mmcif/mmcif_files \ --output_dir ./out \ --modelname model$i \ --param_path $af2Root/params/paramsmodel$i.npz \ --cpus 12 \ --uniref90_database_path $af2Root/uniref90/uniref90.fasta \ --mgnify_database_path $af2Root/mgnify/mgy_clusters.fa \ --pdb70_database_path $af2Root/pdb70/pdb70 \ --uniclust30_database_path $af2Root/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --bfd_database_path $af2Root/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --jackhmmer_binary_path which jackhmmer \ --hhblits_binary_path which hhblits \ --hhsearch_binary_path which hhsearch \ --kalign_binary_path which kalign done

lzhangUT commented 2 years ago

Thank you very much! I really appreciate it. I will be running the similar stuff soon

zhoujingyu13687306871 commented 2 years ago

Thank you very much! I really appreciate it. I will be running the similar stuff soon

my pleasure, I will contact you if the job completed

lzhangUT commented 2 years ago
Hi Any updates on you run? I am downloading the data from AlphaFold GitHub, it is taking so long….    From: zhoujingyu13687306871Sent: Thursday, July 21, 2022 5:30 PMTo: hpcaitech/FastFoldCc: lzhangUT; CommentSubject: Re: [hpcaitech/FastFold] About error of inference question (Issue #28) and as you asked, --model_name model_1 --param_path $af2Root/params/params_model_1.npz \How did you modify it to predict 5 models at once?I'm running script, but I'm not sure if it will work well, it will take a while, please hold on###############################################################module load anaconda/2020.11module load cuda/11.5.0-gcc-4.8.5-atdmodule load gcc/8.3.0-gcc-4.8.5-cppsource activate fastfoldfor i in $(seq 1 5);doaf2Root=/home/bingxing2/public/alphafold2.1.1torchrun --nproc_per_node=2 ./inference.py mono.fasta $af2Root/pdb_mmcif/mmcif_files --output_dir ./out --model_name model_$i --param_path $af2Root/params/params_model_$i.npz --cpus 12 --uniref90_database_path $af2Root/uniref90/uniref90.fasta --mgnify_database_path $af2Root/mgnify/mgy_clusters.fa --pdb70_database_path $af2Root/pdb70/pdb70 --uniclust30_database_path $af2Root/uniclust30/uniclust30_2018_08/uniclust30_2018_08 --bfd_database_path $af2Root/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --jackhmmer_binary_path which jackhmmer --hhblits_binary_path which hhblits --hhsearch_binary_path which hhsearch --kalign_binary_path which kaligndone—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***> 
zhoujingyu13687306871 commented 2 years ago

it works . but the output dir only exist one of five pdb model

---- Replied Message ---- | From | @.> | | Date | 07/24/2022 09:17 | | To | @.> | | Cc | @.**@.> | | Subject | Re: [hpcaitech/FastFold] About error of inference question (Issue #28) |

Hi Any updates on you run? I am downloading the data from AlphaFold GitHub, it is taking so long…. From: zhoujingyu13687306871Sent: Thursday, July 21, 2022 5:30 PMTo: hpcaitech/FastFoldCc: lzhangUT; CommentSubject: Re: [hpcaitech/FastFold] About error of inference question (Issue #28) and as you asked, --model_name model_1 --param_path $af2Root/params/params_model_1.npz \How did you modify it to predict 5 models at once?I'm running script, but I'm not sure if it will work well, it will take a while, please hold on###############################################################module load anaconda/2020.11module load cuda/11.5.0-gcc-4.8.5-atdmodule load gcc/8.3.0-gcc-4.8.5-cppsource activate fastfoldfor i in $(seq 1 5);doaf2Root=/home/bingxing2/public/alphafold2.1.1torchrun --nproc_per_node=2 ./inference.py mono.fasta $af2Root/pdb_mmcif/mmcif_files --output_dir ./out --model_name model_$i --param_path $af2Root/params/params_model_$i.npz --cpus 12 --uniref90_database_path $af2Root/uniref90/uniref90.fasta --mgnify_database_path $af2Root/mgnify/mgy_clusters.fa --pdb70_database_path $af2Root/pdb70/pdb70 --uniclust30_database_path $af2Root/uniclust30/uniclust30_2018_08/uniclust30_2018_08 --bfd_database_path $af2Root/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --jackhmmer_binary_path which jackhmmer --hhblits_binary_path which hhblits --hhsearch_binary_path which hhsearch --kalign_binary_path which kaligndone—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>