hpcaitech / FastFold

Optimizing AlphaFold Training and Inference on GPU Clusters
Apache License 2.0
557 stars 86 forks source link

Template all atom mask was all zeros error #106

Closed dominik-handler closed 1 year ago

dominik-handler commented 1 year ago

Hi,

i am running the latest release of fastfold using this command:

python inference.py input.fa database/pdb_mmcif/mmcif_files/ \
    --output_dir output/ \
    --gpus 1 \
    --model_preset multimer \
    --uniref90_database_path database/uniref90/uniref90.fasta \
    --mgnify_database_path database/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path database/pdb70/pdb70 \
    --uniclust30_database_path database/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path database/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniprot_database_path database/uniprot/uniprot_sprot.fasta \
    --pdb_seqres_database_path database/pdb_seqres/pdb_seqres.txt  \
    --param_path database/params/params_model_1_multimer_v2.npz \
    --model_name model_1_multimer_v2 \
    --jackhmmer_binary_path `which jackhmmer` \
    --hhblits_binary_path `which hhblits` \
    --hhsearch_binary_path `which hhsearch` \
    --kalign_binary_path `which kalign`

I came across a problem during the template generation. I get the following error message:

[11/03/22 14:41:37] INFO     colossalai - root - INFO: Invalid resolution format: ['.']
                    INFO     colossalai - root - INFO: Found an exact template match 6v8o_I.
Traceback (most recent call last):
  File "/scratch-cbe/users/handler/fastfold/FastFold/fastfold/data/templates.py", line 859, in _process_single_hit
    features, realign_warning = _extract_template_features(
  File "/scratch-cbe/users/handler/fastfold/FastFold/fastfold/data/templates.py", line 651, in _extract_template_features
    raise TemplateAtomMaskAllZerosError(
fastfold.data.templates.TemplateAtomMaskAllZerosError: Template all atom mask was all zeros: 6v8o_I. Residue range: 415-475

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "inference.py", line 513, in <module>
    main(args)
  File "inference.py", line 148, in main
    inference_multimer_model(args)
  File "inference.py", line 263, in inference_multimer_model
    feature_dict = data_processor.process_fasta(
  File "/scratch-cbe/users/handler/fastfold/FastFold/fastfold/data/data_pipeline.py", line 1165, in process_fasta
    chain_features = self._process_single_chain(
  File "/scratch-cbe/users/handler/fastfold/FastFold/fastfold/data/data_pipeline.py", line 1114, in _process_single_chain
    chain_features = self._monomer_data_pipeline.process_fasta(
  File "/scratch-cbe/users/handler/fastfold/FastFold/fastfold/data/data_pipeline.py", line 942, in process_fasta
    template_features = make_template_features(
  File "/scratch-cbe/users/handler/fastfold/FastFold/fastfold/data/data_pipeline.py", line 76, in make_template_features
    templates_result = template_featurizer.get_templates(
  File "/scratch-cbe/users/handler/fastfold/FastFold/fastfold/data/templates.py", line 1163, in get_templates
    result = _process_single_hit(
  File "/scratch-cbe/users/handler/fastfold/FastFold/fastfold/data/templates.py", line 885, in _process_single_hit
    "%s_%s (sum_probs: %.2f, rank: %d): feature extracting errors: "
TypeError: must be real number, not NoneType

The lines in the hhblits output corresponding to the ID raising the error are:

hmm_output.sto:#=GS 6v8o_I/416-476  DE [subseq from] mol:protein length:557  Chromatin structure-remodeling complex protein RSC8
hmm_output.sto:6v8o_I/416-476          -----EISEKYIEESQAIIQEL.VKLTMEKLESKF.TKLCDLETQlEMEKLKYVKES..eK.M.lN...D....RLSLS-....--------......-.--------..--....------------------------------------------------------------------------------------
hmm_output.sto:#=GR 6v8o_I/416-476  PP .....56799************.************.**9998865378888876554..24.4.25...5....65555.........................................................................................................................
hmm_output.sto:6v8o_I/416-476          -----
hmm_output.sto:#=GR 6v8o_I/416-476  PP .....

Is this a bug or is the problem here on my side.

Another question, is it correct to change the name of the model to params_model_1_multimer_v2.npz? In your readme you use params_model_1_multimer.npz but this is not included in the downloaded parameter tar file.

All the best and thank you, Dominik

Gy-Lu commented 1 year ago

Hi, Can you show us your input.fa file? We would try to reproduce the bug. And, changing the name of params is ok, they may recall the v1 params after releasing the v2 ones.

dominik-handler commented 1 year ago

Hi, thank you for looking into it.

this here is the input file: input.txt

Gy-Lu commented 1 year ago

Hi, I have reproduced it. It seems that your sequences(one or both) in input.fa are invalid. But it's not really our area...and we can't tell why the sequences are invalid. FastFold is a fast and memory-friendly implement of AlphaFold2. And this run fail raised for AlphaFold2 not supporting to predict these sequences.

Thanks for reporting it to us anyway :-)