Open lohedges opened 2 years ago
A common location of HETATM
records is ACE/NME
caps in a chain. In this case, the records must be located in the correct place within the chain, i.e. first or last residue, not after the TER
. In this case, the HETATM
naming appears to be irrelevant, i.e. tLEaP
will work if the records are renamed to ATOM
, which is what we used to do.
I am increasingly beginning to think that it will be hard to update out parser to reliably create/reconstruct a PDB file from a Sire System, unless we have in place a bunch or robust rules for detecting things, e.g. base on residue name, element, etc. In most cases a PDB will be a direct starting point for our users, so perhaps we want to consider bypassing Sire and directly passing this through to the parameterisation engine, i.e. we only create a system with the output of parameterisation, hence reading/parameterising in one go.
This is challenging. Is it always the case that the atom number increases sequentially in the problematic files? If so, could we do a simple re-ordering of the PDB file after writing to sort the lines into atom number order?
Or could we do something when reading a file that adds a Boolean flag or similar to say when an atom has to be followed by a TER
? Or maybe even have a property in a molecule that is just a list of AtomIdx
values that must be followed by a TER
(and then use this to place the TER
records, skipping their placement via Molecule/Chain?).
A challenge with both these approaches is that they rely on a "correctly" formatted PDB with TER
records as input. They would break as soon as we recombined or edited molecules, or if new molecules were created from scratch (or imported from other file formats). Preserving or creating this extra TER
information could be challenging and error-prone.
I do like the idea of giving users the ability to read and parameterise a PDB file into a Sire system in one go. There are a lot of extra properties that are useful for other bits of the code (e.g. atomic charges, bonding/connectivity information, total charge on the molecule) that are missing or ambiguous from a PDB file. PDBs also tend to have missing atoms (hydrogens, residues from chains etc) which can also lead to special-case or ambiguity-fixing code. The only challenge would be making sure we don't write yet another PDB parser (we already have 2 in Sire...!). It is definitely worth some thought.
It's a bit of a mess to be honest, which is unsurprising given the abuse of the PDB format. Some example input files appear to be generated by chopping out bits of existing PDBs, so things like the numbering, chain identifiers, etc., aren't important.
I am already doing something close to what you suggest, i.e. flagging if an atom is a real TER
atom when loaded from the PDB, i.e. using a boolean value. When present, I'll try to re-use this information when writing the records, rather than simply traversing the chains as I am doing now (which works for a general molecule, which might have been loaded from any format). I have some hacks to re-insert TER
and HETATM
records in the right place given the latter approach, but it's clearly not working in all cases.
From passing a bunch of example PDB files to tLEaP
it is apparent that the HETATM
name is irrelevant, i.e. you can change this to ATOM
with no change to the output. (This is what we used to do, i.e. treat everything as ATOM
, since it makes no difference.) What is important, though, is the position of HETATM
records. Those that appear after the TER
must remain there for things to work. Ideally we'd have some rules to understand which HETATM
records should appear where, but having looked at the files this seems non-trivial. For now I think I'll just add some logic that tries to put the TER
records in the correct place using the boolean flag that I am already storing. I'm not sure if we need to put all post-ter records, i.e. those associated with all chains, at the end of the file, or whether you can safely place them after the TER
then start a new chain immediately after, i.e. without another TER
.
From debugging this BioSimSpace issue it is clear that our handling of HETATM records from PDB files is problematic and needs improving. Unfortunately the formatting of these records seems to be quite variable between PDBs, making it hard to develop a single strategy for dealing with them. For example (copied from the above issue thread):
For example, in this there a
HETAM
s with the same chain identifier before and after theTER
. Some examples of the different formatting:HETATM
in chainB
before and afterTER
, followed byHETATM
s from chainA
.ATOM
andHETATM
interspersed within the same chain.In my option the important thing isn't necessarily the PDB files themselves, rather what
LEaP
etc. require in order to function. (In most cases someone will be simply loading a PDB as a starting point for parametrisation.) As such, seeing howpdb4amber
processes a bunch of files including various types ofHETAM
formatting. In some cases these are converted toATOM
records, in others they are left in place, and sometimes they are even moved. ParmEd uses the approach of labelling everything in a non-standard residue (using template name matching) as aHETATM
, but I'm not sure how it deals with those that are misplaced.Our main problem is that we fully convert the information from the PDB into an internal molecular data structure. Residues in the PDB are reparented to their chains, which are reparented to molecules. When writing back, we reverse this process. If some
HETATM
records need to be placed before the end of a chain (where theTER
record is placed) and some after, this is very tricky to achieve without knowing exactly which ones should go where, and why.I'll try to determine some rules-of-thumb for the position of various
HETATM
records, then test how robust these are. Perhaps it's possible to move all records to the end of the file without issue, i.e. after the finalTER
. This would certainly be the easiest solution.