Open djinnome opened 2 months ago
I believe it is something related to the pdb from the protein data bank. I tried different methods to get the sequence from this pdb and other pdbs from the protein data bank, including the pdb2fasta from the yang zhang lab, and they all returned the "incorrect" sequences.
However, when I used the pdb of the AlphaFold prediction, I always got the correct sequence.
I think the possible cause is there are some compounds in pdbs from the protein data bank. If we can remove those compounds from pdb files, we can probably get the correct sequence, but I don't know how. Or we should let the users to remove these compounds from pdb files, otherwise they will get incorrect results.
That makes sense. I think it would be best to provide instructions that explain that CLEAN-Contact only works for ligand-free PDB files, and perhaps provides a helpful error message that says something like "ligand detected in structure." Please provide a ligand-free PDB file.
That's a good idea! But I need some time to find a way to detect ligands from pdb.
I believe I figured out the reason...
If you take a look into the pdb files from the protein data bank, for example, the 6E08.pdb, when you navigate to the first line in the pdb file starting with ATOM
, which is line 631
for 6E08.pdb, you can find the 'first' amino acid in this pdb is GLN
. Even though they do provide the actual sequence from line 514
, both Biopython and biotite, which are the packages we use to extract contact maps and sequences from pdb, they all only recognize lines in the pdb file starting with ATOM
. In this case, we will not only get the incorrect sequence from pdb file, but get the incorrect contact map from pdb file.
Maybe a simple solution is to let the users only use structures predicted using AlphaFold or ESMFold or Resettafold, and let users to avoid using pdbs from PDB.
I pasted this PDB in the pdb file: 6e08.pdb.gz
And this was the email message:
However, the correct EC number is: 1.1.1.49 and this is the correct sequence: rcsb_pdb_6E08.fasta.gz
This is the PDB of the AlphaEnzyme prediction: AF-P0AC53-F1-model_v4.pdb.gz