marrink-lab / vermouth-martinize

Describe and apply transformation on molecular structures and topologies
Apache License 2.0
90 stars 40 forks source link

Mis-handling of some insertion codes #348

Open jbarnoud opened 3 years ago

jbarnoud commented 3 years ago

Edit: The issue is about insertion codes, not alternate location.

Some PDB specify alternate conformation by specifying one conformation without the alternate indicator and the following ones with the indicator. An example of this is 1amh:

ATOM   1230  N   GLY A 184       2.557  -3.590  23.433  1.00 15.72           N
ATOM   1231  CA  GLY A 184       3.626  -3.224  24.374  1.00 15.63           C
ATOM   1232  C   GLY A 184       3.345  -3.716  25.795  1.00 14.94           C
ATOM   1233  O   GLY A 184       2.536  -4.633  26.006  1.00 17.82           O
ATOM   1234  N   PHE A 184A      4.054  -3.065  26.704  1.00 15.27           N
ATOM   1235  CA  PHE A 184A      4.000  -3.345  28.144  1.00 17.29           C
ATOM   1236  C   PHE A 184A      3.605  -2.082  28.915  1.00 19.37           C
ATOM   1237  O   PHE A 184A      3.770  -0.956  28.424  1.00 18.07           O
ATOM   1238  CB  PHE A 184A      5.376  -3.804  28.631  1.00 17.67           C
ATOM   1239  CG  PHE A 184A      5.977  -4.906  27.758  1.00 15.44           C
ATOM   1240  CD1 PHE A 184A      6.772  -4.573  26.654  1.00 14.38           C
ATOM   1241  CD2 PHE A 184A      5.731  -6.247  28.066  1.00 11.27           C
ATOM   1242  CE1 PHE A 184A      7.320  -5.585  25.857  1.00 13.12           C
ATOM   1243  CE2 PHE A 184A      6.279  -7.260  27.269  1.00 10.40           C
ATOM   1244  CZ  PHE A 184A      7.073  -6.929  26.164  1.00 13.43           C

Because we use the atoms that do not have an alternate indicator OR that have it set to A, the example above takes both the conformations which results in DSSP failing:

raise DSSPError(message.format(err=process.stderr, file=tmpfile_name))
vermouth.dssp.dssp.DSSPError: DSSP encountered an error. The message was DSSP could not be created due to an error:
inconsistent residue types in atom records for residue 164 (PHE != GLY)

There are different ways of solving the issue:

pckroon commented 3 years ago

when reading an atom with the alternate indicator set to A, check that we did not already read a residue without an indicator, if so drop the atom after reading the PDB, check the residue graph for redundant residues

This raises the question: What is a residue? Chain, resname, resid? Chain, resname, resid, alternate? Some other combination of attributes?

From the PDB file format spec (3.30, emphasis mine):

AltLoc is the place holder to indicate alternate conformation. The alternate conformation can be in the entire polymer chain, or several residues or partial residue (several atoms within one residue). If an atom is provided in more than one position, then a non-blank alternate location indicator must be used for each of the atomic positions. Within a residue, all atoms that are associated with each other in a given conformation are assigned the same alternate position indicator. There are two ways of representing alternate conformation- either at atom level or at residue level (see examples). For atoms that are in alternate sites indicated by the alternate site indicator, sorting of atoms in the ATOM/HETATM list uses the following general rules:

  • In the simple case that involves a few atoms or a few residues with alternate sites, the coordinates occur one after the other in the entry.
  • In the case of a large heterogen groups which are disordered, the atoms for each conformer are listed together.

So the way I interpret it the protein databank doesn't adhere to its own specification??

Alphabet letters are commonly used for insertion code. The insertion code is used when two residues have the same numbering. The combination of residue numbering and insertion code defines the unique residue.

Does this mean that resid 33 insertion_code B chain A is the same residue as resid 33 insertion_code B chain B??

jbarnoud commented 3 years ago

So the way I interpret it the protein databank doesn't adhere to its own specification??

Some parts may not be as enforced as others. Or it was not always enforced, and now there is too much ambiguity to automatically generate a new version without breaking things.

This raises the question: What is a residue? Chain, resname, resid? Chain, resname, resid, alternate? Some other combination of attributes?

chain + resid + alternate + insertion code, to that we can add the resname for broken cases where you have more than one consecutive residue with the same resid. That broken case can be very broken because the consecutive residues may very well have the same name.

juminlee commented 1 year ago

@pckroon, I would like to bring this up again. It seems like that there is a typo in the pdb.py.

Doesn't following line: https://github.com/marrink-lab/vermouth-martinize/blob/b450b6793e85e4f31542d9232b79a12c229a43ec/vermouth/pdb/pdb.py#L542

supposed to be: insertion_code = get_not_none(node, 'insertion_code', '') ?

pckroon commented 1 year ago

@juminlee Well spotted, you're 100% right! Do you have time to open a PR with a fix?

juminlee commented 1 year ago

Sure. I will make a PR.