sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.8k stars 462 forks source link

Generating template error #409

Open Abhishaike opened 1 year ago

Abhishaike commented 1 year ago

I'd like to get only the MSA + template, but am running into issues with the template features upon using get_msa_and_templates. It works perfectly fine when use_templates is False, but I get an hhsearch issue when it's set to True.

I'm also happy to move my template search away from Colabfold, I get the feeling that templates are still being worked on here. Is there an alternative library I could use to generate .pdb template files?

Here's a minimum reproducible error below.

bash:

mkdir fasta_vol
mkdir result

echo ">Sequence_1" >> fasta_vol/output1.fasta
echo "MSGMKK:LYEYTVTTLDEFL:EKLKEFILNTSKDKIYKLTITN" >> fasta_vol/output1.fasta
echo ">Sequence_2" >> fasta_vol/output2.fasta
echo "VKLPINGW:AVYVHRTLMSCPVGEAWSASACHDG" >> fasta_vol/output2.fasta

python:

import os
from colabfold.batch import get_msa_and_templates, get_queries, safe_filename, msa_to_str
from colabfold.utils import (DEFAULT_API_SERVER)
from pathlib import Path
import shutil

fasta_volume_path = 'fasta_vol'
a3m_volume_path = "a3m_vol"
msa_mode = 'mmseqs2_uniref_env'

queries, is_complex = get_queries(fasta_volume_path)
for job_number, (raw_jobname, query_sequence, _) in enumerate(queries):
    jobname = safe_filename(raw_jobname)
    (unpaired_msa, paired_msa, query_seqs_unique, query_seqs_cardinality, template_features) \
                  = get_msa_and_templates(jobname = jobname, 
                                          query_sequences = query_sequence, 
                                          result_dir = Path(a3m_volume_path), 
                                          msa_mode = msa_mode, 
                                          use_templates = True, 
                                          custom_template_path = None, 
                                          pair_mode = "unpaired_paired", 
                                          host_url = DEFAULT_API_SERVER)
    msa = msa_to_str(unpaired_msa, paired_msa, query_seqs_unique, query_seqs_cardinality)
    Path(a3m_volume_path).joinpath(f"{jobname}.a3m").write_text(msa)

Resulting error:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "venv/lib/python3.10/site-packages/colabfold/batch.py", line 780, in get_msa_and_templates
    template_feature = mk_template(
  File "venv/lib/python3.10/site-packages/colabfold/batch.py", line 172, in mk_template
    hhsearch_result = hhsearch_pdb70_runner.query(a3m_lines)
  File "venv/lib/python3.10/site-packages/alphafold/data/tools/hhsearch.py", line 86, in query
    process = subprocess.Popen(
  File "python/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "python/lib/python3.10/subprocess.py", line 1847, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'hhsearch'
milot-mirdita commented 1 year ago

Please use the run_mmseqs2 function instead. That one doesn't require hhsearch and it will return the MSA in a3m format, the PDB-list in m8 format and the PDB files.

Abhishaike commented 1 year ago

That worked! But am now getting this error when trying to feed them into colabfold: ValueError: PDB contains an insertion code at chain A and residue index 169. These are not supported.

Given that these are .cif files generated directly by colabfold, should I expect there to be insertion codes?

If important, this is the custom template directory I'm feeding in:

1ivf.cif  1w21.cif  3cl0.cif  3san.cif  4b7j.cif  4mju.cif  5nz4.cif  6hg0.cif  6lxi.cif  6pzd.cif          pdb70_a3m.ffindex   pdb70_cs219.ffindex
1nmc.cif  3b7e.cif  3sal.cif  3tia.cif  4h53.cif  4qn4.cif  6crd.cif  6hgb.cif  6lxk.cif  pdb70_a3m.ffdata  pdb70_cs219.ffdata
milot-mirdita commented 1 year ago

These files come straight from the PDB, with all the variance that includes. I'll have to look into it.

samuelmurail commented 1 month ago

Hello,

I have a similar issue but when using batch.run(). When I am using the template option pdb100 I have this error:

2024-06-05 18:51:24,436 Could not get MSA/templates for Test_1dc59_1: [Errno 2] No such file or directory: 'hhsearch'
Traceback (most recent call last):
  File "/shared/projects/alphafold/murail/conda/env/colabfold_tmp_2/lib/python3.10/site-packages/colabfold/batch.py", line 1453, in run
    = get_msa_and_templates(jobname, query_sequence, a3m_lines, result_dir, msa_mode, use_templates,
  File "/shared/projects/alphafold/murail/conda/env/colabfold_tmp_2/lib/python3.10/site-packages/colabfold/batch.py", line 781, in get_msa_and_templates
    template_feature = mk_template(
  File "/shared/projects/alphafold/murail/conda/env/colabfold_tmp_2/lib/python3.10/site-packages/colabfold/batch.py", line 132, in mk_template
    hhsearch_result = hhsearch_pdb70_runner.query(a3m_lines)
  File "/shared/projects/alphafold/murail/conda/env/colabfold_tmp_2/lib/python3.10/site-packages/alphafold/data/tools/hhsearch.py", line 86, in query
    process = subprocess.Popen(
  File "/shared/projects/alphafold/murail/conda/env/colabfold_tmp_2/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/shared/projects/alphafold/murail/conda/env/colabfold_tmp_2/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'hhsearch'

And here is my env.yml :

name: colabfold-1.5.5
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - colabfold=1.5.5
  - kalign2=2.04
  - hhsuite=3.3.0
  - openmm=7.7.0 
  - pdbfixer
  - jax[cuda11_pip]==0.4.23
  - ipykernel
  - ipywidgets
  - seaborn>=0.11
  - pandas>=1.3.4
  - nglview>=3.0
  - gcc_linux-64
  - pip
  - pip:
    - colabfold_jupyter@git+https://gitlab.rpbs.univ-paris-diderot.fr/rpbs/colabfold_jupyter.git@main
    - tqdm>=4.0
    - pdb_numpy>=0.0.6
    - cmcrameri>=1.7
    - git+https://github.com/samuelmurail/af2_analysis.git@main

Any idea how to solve the error ?