G6PDH does not return the correct sequence or EC number

djinnome commented 2 months ago

I pasted this PDB in the pdb file: 6e08.pdb.gz

And this was the email message:

Check twice before you click! This email originated from outside PNNL.

Here are the CLEAN-Contact prediction results for your submission:

                    UniProt ID: g6pdh
                    Sequence: QSDTHIFIIMGASGDLAKKKIYPTIWWLFRDGLLPENTFIVGYARSRLTVADIRKQSEPFFKATPEEKLKLEDFFARNSYVAGQYDDAASYQRLNSHMNALHLGSQANRLFYLALPPTVYEAVTKNIHESCMSQIGWNRIIVEKPFGRDLQSSDRLSNHISSLFREDQIYRIDHYLGKEMVQNLMVLRFANRIFGPIWNRDNIACVILTFKEPFGTEGRGGYFDEFGIIRDVMQNHLLQMLCLVAMEKPASTNSDDVRDEKVKVLKCISEVQANNVVLGQYVGNPDGEGEATKGYLDDPTVPRGSTTATFAAVVLYVENERWDGVPFILRCGKALNERKAEVRLQFHDVAGDIFHQQCKRNELVIRVQPNEAVYTKMMTKKPGMFFNPEESELDLTYGNRYKNVKLPDAYERLILDVFCGSQMHFVRSDELREAWRIFTPLLHQIELEKPKPIPYIYGSRGPTEADELMKRVGFQYEGTYKWVNPH
                    P-value selection: EC:4.2.3.15
                    Max-separation selection: EC:4.2.3.15

                    Do not reply to this email as the mailbox is not monitored.
                    For result or usage issues, please contact the first or corresponding authors of the paper.

However, the correct EC number is: 1.1.1.49 and this is the correct sequence: rcsb_pdb_6E08.fasta.gz

This is the PDB of the AlphaEnzyme prediction: AF-P0AC53-F1-model_v4.pdb.gz

yuxin212 commented 2 months ago

I believe it is something related to the pdb from the protein data bank. I tried different methods to get the sequence from this pdb and other pdbs from the protein data bank, including the pdb2fasta from the yang zhang lab, and they all returned the "incorrect" sequences.

However, when I used the pdb of the AlphaFold prediction, I always got the correct sequence.

I think the possible cause is there are some compounds in pdbs from the protein data bank. If we can remove those compounds from pdb files, we can probably get the correct sequence, but I don't know how. Or we should let the users to remove these compounds from pdb files, otherwise they will get incorrect results.

djinnome commented 2 months ago

That makes sense. I think it would be best to provide instructions that explain that CLEAN-Contact only works for ligand-free PDB files, and perhaps provides a helpful error message that says something like "ligand detected in structure." Please provide a ligand-free PDB file.

yuxin212 commented 2 months ago

That's a good idea! But I need some time to find a way to detect ligands from pdb.

yuxin212 commented 2 months ago

I believe I figured out the reason... If you take a look into the pdb files from the protein data bank, for example, the 6E08.pdb, when you navigate to the first line in the pdb file starting with ATOM, which is line 631 for 6E08.pdb, you can find the 'first' amino acid in this pdb is GLN. Even though they do provide the actual sequence from line 514, both Biopython and biotite, which are the packages we use to extract contact maps and sequences from pdb, they all only recognize lines in the pdb file starting with ATOM. In this case, we will not only get the incorrect sequence from pdb file, but get the incorrect contact map from pdb file.

Maybe a simple solution is to let the users only use structures predicted using AlphaFold or ESMFold or Resettafold, and let users to avoid using pdbs from PDB.

pnnl-predictive-phenomics / clean-contact

G6PDH does not return the correct sequence or EC number #4