openforcefield / protein-ligand-benchmark

Protein-Ligand Benchmark Dataset for Free Energy Calculations
MIT License
144 stars 15 forks source link

Thrombin PDB unusual residue indexes #84

Closed JSLJ23 closed 1 year ago

JSLJ23 commented 1 year ago

I was just wondering how come the protein.pdb in /data/thrombin/01_protein/crd/ has a bunch of weird residue namings for 148 with A, B, C, D & E for subsequent residues instead of 149, 150 151, 152 & 153 respectively? This is similarly observed for residue 184, 186, 204 & 221.

IAlibay commented 1 year ago

has a bunch of weird residue namings for

@JSLJ23 there are PDB insertion codes, it's unfortunately not a very well documented feature but you can see references to them in places like https://www.wwpdb.org/documentation/file-format-content/format33/sect3.html and https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM.

These icodes are retained in these structures because they exist in the original solved PDB structure (2ZFF in the case of Thrombin).

IAlibay commented 1 year ago

Whilst not all tools can fully handle icodes (due to parsers not being fully format compliant), the decision was taken since the PDB file is format compliant, the icodes should be retained to retain the original sequence information.

I'm closing this issue as completed but please do re-open if you think it remains an issue @JSLJ23.

JSLJ23 commented 1 year ago

Ok just a quick question @IAlibay, are these residues with the insertion codes from some genetic modification or mutation and the sequence of the PDB is compared with reference to the canonical uniprot sequence? I'm trying to get some idea of why there would be insertions in a sequence.

image

On RCSB's Protein Feature View for 2ZFF, it shows that the 148 region has unmodelled residues but I don't see any insertions from the PDB sequence when aligned to the Uniprot Sequence P00734, or am I missing something here?