How to handle HETATM records in PDB files

From debugging this BioSimSpace issue it is clear that our handling of HETATM records from PDB files is problematic and needs improving. Unfortunately the formatting of these records seems to be quite variable between PDBs, making it hard to develop a single strategy for dealing with them. For example (copied from the above issue thread):

For example, in this there a HETAMs with the same chain identifier before and after the TER. Some examples of the different formatting:

HETATM in chain B before and after TER, followed by HETATMs from chain A.

...
ATOM   2097 HD11 ILE B  36      -7.894   6.751 -22.957  1.00 52.74           H
ATOM   2098 HD12 ILE B  36      -8.945   7.001 -24.122  1.00 52.74           H
ATOM   2099 HD13 ILE B  36      -8.598   5.518 -23.670  1.00 52.74           H
HETATM 2100  N   NH2 B  37      -7.355   7.417 -29.288  1.00 58.31           N
TER    2101      NH2 B  37
HETATM 2102 ZN    ZN B 101       0.000   0.000  -9.201  0.33 15.72          ZN
HETATM 2103  O  AHOH A 201     -30.782  29.811 -17.433  0.50 20.93           O
HETATM 2104  O  BHOH A 201     -30.377  31.224 -16.358  0.50 18.33           O
HETATM 2105  O   HOH A 202     -10.750  28.703 -23.497  1.00 39.82           O
...

ATOM and HETATM interspersed within the same chain.

...
HETATM 2006  HEABXCP B  31      -6.322  12.783 -15.760  0.37 16.94           H
HETATM 2007  HA AXCP B  31      -6.311  10.105 -16.572  0.63 23.43           H
HETATM 2008  HA BXCP B  31      -5.758  10.612 -16.628  0.37 19.64           H
ATOM   2009  N  AHIS B  32      -5.542  10.707 -18.873  0.63 18.98           N
ANISOU 2009  N  AHIS B  32     1967   2279   2965    480   -109    265       N
ATOM   2010  N  BHIS B  32      -5.238  10.930 -18.956  0.37 18.62           N
ANISOU 2010  N  BHIS B  32     1887   2264   2926    494    -76    294       N
ATOM   2011  CA AHIS B  32      -4.988  11.199 -20.158  0.63 21.40           C
...

In my option the important thing isn't necessarily the PDB files themselves, rather what LEaP etc. require in order to function. (In most cases someone will be simply loading a PDB as a starting point for parametrisation.) As such, seeing how pdb4amber processes a bunch of files including various types of HETAM formatting. In some cases these are converted to ATOM records, in others they are left in place, and sometimes they are even moved. ParmEd uses the approach of labelling everything in a non-standard residue (using template name matching) as a HETATM, but I'm not sure how it deals with those that are misplaced.

Our main problem is that we fully convert the information from the PDB into an internal molecular data structure. Residues in the PDB are reparented to their chains, which are reparented to molecules. When writing back, we reverse this process. If some HETATM records need to be placed before the end of a chain (where the TER record is placed) and some after, this is very tricky to achieve without knowing exactly which ones should go where, and why.

I'll try to determine some rules-of-thumb for the position of various HETATM records, then test how robust these are. Perhaps it's possible to move all records to the end of the file without issue, i.e. after the final TER. This would certainly be the easiest solution.

A common location of HETATM records is ACE/NME caps in a chain. In this case, the records must be located in the correct place within the chain, i.e. first or last residue, not after the TER. In this case, the HETATM naming appears to be irrelevant, i.e. tLEaP will work if the records are renamed to ATOM, which is what we used to do.

I am increasingly beginning to think that it will be hard to update out parser to reliably create/reconstruct a PDB file from a Sire System, unless we have in place a bunch or robust rules for detecting things, e.g. base on residue name, element, etc. In most cases a PDB will be a direct starting point for our users, so perhaps we want to consider bypassing Sire and directly passing this through to the parameterisation engine, i.e. we only create a system with the output of parameterisation, hence reading/parameterising in one go.

This is challenging. Is it always the case that the atom number increases sequentially in the problematic files? If so, could we do a simple re-ordering of the PDB file after writing to sort the lines into atom number order?

Or could we do something when reading a file that adds a Boolean flag or similar to say when an atom has to be followed by a TER? Or maybe even have a property in a molecule that is just a list of AtomIdx values that must be followed by a TER (and then use this to place the TER records, skipping their placement via Molecule/Chain?).

A challenge with both these approaches is that they rely on a "correctly" formatted PDB with TER records as input. They would break as soon as we recombined or edited molecules, or if new molecules were created from scratch (or imported from other file formats). Preserving or creating this extra TER information could be challenging and error-prone.

I do like the idea of giving users the ability to read and parameterise a PDB file into a Sire system in one go. There are a lot of extra properties that are useful for other bits of the code (e.g. atomic charges, bonding/connectivity information, total charge on the molecule) that are missing or ambiguous from a PDB file. PDBs also tend to have missing atoms (hydrogens, residues from chains etc) which can also lead to special-case or ambiguity-fixing code. The only challenge would be making sure we don't write yet another PDB parser (we already have 2 in Sire...!). It is definitely worth some thought.

It's a bit of a mess to be honest, which is unsurprising given the abuse of the PDB format. Some example input files appear to be generated by chopping out bits of existing PDBs, so things like the numbering, chain identifiers, etc., aren't important.

I am already doing something close to what you suggest, i.e. flagging if an atom is a real TER atom when loaded from the PDB, i.e. using a boolean value. When present, I'll try to re-use this information when writing the records, rather than simply traversing the chains as I am doing now (which works for a general molecule, which might have been loaded from any format). I have some hacks to re-insert TER and HETATM records in the right place given the latter approach, but it's clearly not working in all cases.

From passing a bunch of example PDB files to tLEaP it is apparent that the HETATM name is irrelevant, i.e. you can change this to ATOM with no change to the output. (This is what we used to do, i.e. treat everything as ATOM, since it makes no difference.) What is important, though, is the position of HETATM records. Those that appear after the TER must remain there for things to work. Ideally we'd have some rules to understand which HETATM records should appear where, but having looked at the files this seems non-trivial. For now I think I'll just add some logic that tries to put the TER records in the correct place using the boolean flag that I am already storing. I'm not sure if we need to put all post-ter records, i.e. those associated with all chains, at the end of the file, or whether you can safely place them after the TER then start a new chain immediately after, i.e. without another TER.

michellab / Sire

How to handle HETATM records in PDB files #409