sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.97k stars 495 forks source link

template cif contains insertion code #498

Open data2code opened 1 year ago

data2code commented 1 year ago

Expected Behavior

I would like to reuse a template folder for multiple ColabFold runs. So I run ColabFold first on the following input sequence:

>seq EIVLTQSPGTQSLSPGERATLSCRASQSVGNNKLAWYQQRPGQAPRLLIYGASSRPSGVADRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYGQSLSTFGQGTKVEVKRTV:NWFDITNWLWYIK:VQLVQSGAEVKRPGSSVTVSCKASGGSFSTYALSWVRQAPGRGLEWMGGVIPLLTITNYAPRFQGRITITADRSTSTAYLELNSLRPEDTAVYYCAREGTTGDGDLGKPIGAFAHWGQGTLVTVSS

If finds many templates: 2023-09-25 06:30:30,521 Sequence 0 found templates: ['6xe1_L', '7lk9_B', '7s5r_C', '7b0b_L', '7x29_G', '5i1k_L', '6ghg_B', '7kql_L', '7tbf_L', '6ol5_L', '7d0c_F', '6wir_B', '5xmh_L', '6o25_J', '6o29_B', '5gmq_C', '4ypg_L', '4xcy_I', '7u0d_P', '5w1k_N'] 2023-09-25 06:30:35,871 Sequence 1 found templates: ['5cil_P', '5x08_P', '7ekk_P', '4wy7_P', '4xbe_P', '5cin_P', '6o3j_G', '6o42_G', '6o42_I', '7ekb_P', '2fx7_P', '4xaw_P', '6o3g_G', '6o3g_I', '6o3g_Q', '6o3g_S', '6o3j_I', '6o3l_D', '6o3l_E', '6snc_P'] 2023-09-25 06:30:46,110 Sequence 2 found templates: ['5cil_H', '4llv_C', '4xce_C', '4xcn_A', '4ngh_H', '4xce_A', '4xce_H', '4xbp_A', '4xc3_H', '4xcy_H', '7bpk_H', '4xbp_C', '4xbp_E', '7f7e_C', '7bep_D', '5e08_H', '5gzn_C', '5gzn_H', '7czt_I', '6ehw_B']

I pull all the .cif files from seqenv/templates/.cif into one new folder called "mytemplates"

I then run another ColabFold by pointing --custom-template-path at mytemplates and expect it will work. ColabFold failed.

Current Behavior

When using mytemplates as the --custom-template-path, ColabFold complains (I added the problematic template name to the error message):

mk_hhsearch_db raise ValueError( ValueError: PDB **mytemplates/7u0d.cif** contains an insertion code at chain O and residue index 52. These are not supported.

Why 7u0d.cif is good on the first run, but it is not acceptable when we use it as a custom template?

Since I need to predict multiple sequences with small mutations, I would like to reuse the templates without making a query against MSA server each time.

Thanks!

Steps to Reproduce (for bugs)

Please make sure to reproduce the issue after a "Factory Reset" in Colab. If running locally ypdate you local installation colabfold_batch to the newest version. Please provide your input if you can share it.

ColabFold Output (for bugs)

Please make sure to also post the complete ColabFold output. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

data2code commented 1 year ago

I wonder if this insertion code checking is really necessary. If I comment it out, ColabFold seems to work. It would be great to get your expert's opinion. Thanks.

in batch.py, it works if I simply comment out five lines below:

            for chain in model:
                amino_acid_res = []
                for res in chain:
                    #if res.id[2] != " ":
                    #    raise ValueError(
                    #        f"PDB contains an insertion code at chain {chain.id} and residue "
                    #        f"index {res.id[1]}. These are not supported."
                    #    )
                    amino_acid_res.append(
                        residue_constants.restype_3to1.get(res.resname, "X")
                    )