hpcaitech / FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters
Apache License 2.0
556 stars 84 forks source link

Issues with processing the templates #155

Closed s-kyungyong closed 1 year ago

s-kyungyong commented 1 year ago

Hi!

I am running a few test cases and encountered some problems that I think are related to processing the templates.

I have a working example that led to a pdb output:

python /global/scratch/users/skyungyong/Software/FastFold/inference.py --output_dir ./ --model_preset multimer --use_precomputed_alignments Alignments --enable_workflow --inplace --param_path /global/scratch/users/skyungyong/Software/FastFold/data/params/params_model_1_multimer_v3.npz -model_name model_1_multimer AT3G18790-AT3G18790.fasta /global/scratch/users/skyungyong/Software/alphafold-multimer-v2.2.2-080922/Database/pdb_mmcif/mmcif_files/

WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
running in multimer mode...
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
[02/21/23 20:04:34] INFO     colossalai - colossalai - INFO: /global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0
[02/21/23 20:04:35] INFO     colossalai - colossalai - INFO: /global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: /global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py:116 launch
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
Inference time: 144.00418956391513

These are the some of the runs that produced error messages

python /global/scratch/users/skyungyong/Software/FastFold/inference.py --output_dir ./ --model_preset multimer --use_precomputed_alignments Alignments --enable_workflow --inplace --param_path /global/scratch/users/skyungyong/Software/FastFold/data/params/params_model_1_multimer_v3.npz --model_name model_1_multimer AT1G23170-AT1G23170.fasta /global/scratch/users/skyungyong/Software/alphafold-multimer-v2.2.2-080922/Database/pdb_mmcif/mmcif_files/
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
running in multimer mode...
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
[02/21/23 20:16:03] INFO     colossalai - colossalai - INFO: /global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0
[02/21/23 20:16:12] INFO     colossalai - colossalai - INFO: /global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: /global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py:116 launch
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
Traceback (most recent call last):
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 548, in <module>
    main(args)
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 164, in main
    inference_multimer_model(args)
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 293, in inference_multimer_model
    torch.multiprocessing.spawn(inference_model, nprocs=args.gpus, args=(args.gpus, result_q, batch, args))
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 151, in inference_model
    out = model(batch)
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/model/hub/alphafold.py", line 522, in forward
    outputs, m_1_prev, z_prev, x_prev = self.iteration(
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/model/hub/alphafold.py", line 270, in iteration
    template_embeds = self.template_embedder(
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/model/fastnn/embedders_multimer.py", line 368, in forward
    self.template_single_embedder(
  File "/global/scratch/users/skyungyong/Software/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/model/fastnn/embedders_multimer.py", line 238, in forward
    all_atom_multimer.compute_chi_angles(
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/utils/all_atom_multimer.py", line 441, in compute_chi_angles
    chi_angle_atoms_mask = torch.prod(chi_angle_atoms_mask, dim=-1)
RuntimeError: CUDA driver error: invalid argument
python /global/scratch/users/skyungyong/Software/FastFold/inference.py --output_dir ./ --model_preset multimer --use_precomputed_alignments Alignments --enable_workflow --inplace --param_path /global/scratch/users/skyungyong/Software/FastFold/data/params/params_model_1_multimer_v3.npz --model_name model_1_multimer AT1G13220-AT1G13220.fasta /global/scratch/users/skyungyong/Software/alphafold-multimer-v2.2.2-080922/Database/pdb_mmcif/mmcif_files/
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
WARNING:root:Triton is not available, fallback to old kernel.
running in multimer mode...
Traceback (most recent call last):
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/data/templates.py", line 859, in _process_single_hit
    features, realign_warning = _extract_template_features(
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/data/templates.py", line 651, in _extract_template_features
    raise TemplateAtomMaskAllZerosError(
fastfold.data.templates.TemplateAtomMaskAllZerosError: Template all atom mask was all zeros: 6zmi_CE. Residue range: 4-304

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 548, in <module>
    main(args)
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 164, in main
    inference_multimer_model(args)
  File "/global/scratch/users/skyungyong/Software/FastFold/inference.py", line 281, in inference_multimer_model
    feature_dict = data_processor.process_fasta(
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/data/data_pipeline.py", line 1165, in process_fasta
    chain_features = self._process_single_chain(
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/data/data_pipeline.py", line 1114, in _process_single_chain
    chain_features = self._monomer_data_pipeline.process_fasta(
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/data/data_pipeline.py", line 942, in process_fasta
    template_features = make_template_features(
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/data/data_pipeline.py", line 76, in make_template_features
    templates_result = template_featurizer.get_templates(
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/data/templates.py", line 1166, in get_templates
    result = _process_single_hit(
  File "/global/scratch/users/skyungyong/Software/FastFold/fastfold/data/templates.py", line 888, in _process_single_hit
    "%s_%s (sum_probs: %.2f, rank: %d): feature extracting errors: "
TypeError: must be real number, not NoneType

Will these be due to problems associated with the homologous templates themselves, or would there be a fix for this?

Thank you!

double-vin commented 1 year ago

The last problem has been solved. See https://github.com/hpcaitech/FastFold/issues/144 You can use the latest code to solve it.

s-kyungyong commented 1 year ago

I believe the first error was due to cuda incompatibility, and #114 solved the second issue!