output mmCIF doesn't distinguish ATOMs from HETATMs and _atom_site.label_asym_id is overwritten by _atom_site.auth_asym_id

MauriceKarrenbrock commented 4 years ago

Hello,

pdbfixer version : latest version on github openmm version : omnia-devel git_revision 36b6caa67e5f485023ec12d40e12f1700e19f618

If a mmCIF with both ATOMs and HETATMs is parsed and fixed all the HETATMs become ATOMs and the _atom_site.label_asym_id is overwritten by _atom_site.auth_asym_id loosing any information about when a molecule ends and another one begins in the same chain (like if TER lines were missing in a PDB file).

import pdbfixer
import simtk.openmm.app

input_filename = '2gz7.cif'
output_filename = '2gz7_repaired.cif'

with open(input_filename, 'r') as f:
    fixer = pdbfixer.PDBFixer(pdbxfile = f)

fixer.findMissingResidues()
fixer.findNonstandardResidues()
fixer.replaceNonstandardResidues()
fixer.findMissingAtoms()
fixer.addMissingAtoms()

with open(output_filename, 'w') as f:
    simtk.openmm.app.pdbxfile.PDBxFile.writeFile(fixer.topology, fixer.positions, f, keepIds = True)

both the mmCIF files are atached as TXT files 2gz7_repaired.txt 2gz7.txt

Thank you very much in advance and have a nice day!

peastman commented 4 years ago

The fix for the HETATM issue is in https://github.com/openmm/openmm/pull/2572. It doesn't try to remember what was in the input file. Instead it selects the label based on the definition of heterogens.

peastman commented 4 years ago

The chain ID part is in https://github.com/openmm/openmm/pull/2575.

MauriceKarrenbrock commented 4 years ago

Hello, thank you very much for your availability ! The HETATM part is perfect!

But I am not sure about the chain part. In fact now the _atom_site.auth_asym_id has been overwritten by the _atom_site.label_asym_id when actually both information need to be kept because they are both important. In a protein we might have more chains and every chain might have a polypeptide chain, some metallic ions, some organic ligands and some solvent molecules. And it is usually necessary to be able to both distinguish between different chains and between different molecules inside chains. Before https://github.com/openmm/openmm/pull/2575 you could distinguish between chains but not between different molecules inside a chain, and after it you can distinguish between molecules but without knowing from which chain they came from.

For example in the 5aol mmCIF file you find 2 chains A and B. In chain A there is

the polypeptide chain (_atom_site.label_asym_id = A _atom_site.auth_asym_id = A)
a UFV organic ligand (_atom_site.label_asym_id = C _atom_site.auth_asym_id = A)
a ZN ion (_atom_site.label_asym_id = D _atom_site.auth_asym_id = A)
and HOH (_atom_site.label_asym_id = F _atom_site.auth_asym_id = A).

In chain B there is

the polypeptide chain (_atom_site.label_asym_id = B _atom_site.auth_asym_id = B)
a ZN ion ((_atom_site.label_asym_id = E _atom_site.auth_asym_id = B)
and HOH (_atom_site.label_asym_id = G _atom_site.auth_asym_id = B)
(no UFV organic ligand).

But after pdbfixer all the _atom_site.auth_asym_id are equal to _atom_site.label_asym_id making it impossible to know from which chain this molecule comes from.

So if I want to make a molecular dynamics on chain A to calculate the binding affinity of UFV with the polypeptide chain I need to be able to select all the things with _atom_site.auth_asym_id = A and to create a correct topology file I need to know where a molecule ends and a new one begins (by checking _atom_site.label_asym_id), this is very important for example in gromacs that relies heavily on what is written in the structure file.

But with https://github.com/openmm/openmm/pull/2575 I get an output file that has a complete information on where a molecule ends, but I don't know anymore things like which ZN ion is binding with which polypeptide chain, with which chain UFV is binding ecc.

So both information should be kept and written in the output file because they are both vital for the rest of the process to get a MD simulation done.

5aol.txt 5aol_repaired.txt

peastman commented 4 years ago

That isn't actually what the fields mean. Here's how the spec defines _atom_site.auth_asym_id:

An alternative identifier for _atom_site.label_asym_id that may be provided by an author in order to match the identification used in the publication that describes the structure.

You may have found a particular file that happens to use the fields in the way you described, but it's totally non-standardized. Authors are free to use them in any way they want.

MauriceKarrenbrock commented 4 years ago

I see what you mean.

And yes the _atom_site.label_asym_id definition that can be found on the spec is exactly what is happening now.

It is a pointer to _struct_asym.id

The value of _struct_asym.id must uniquely identify a record in the STRUCTASYM list. Note that this item need not be a number; it can be any unique identifier. loop _struct_asym.id _struct_asym.entity_id _struct_asym.details A 1 'one monomer of the dimeric enzyme' B 1 'one monomer of the dimeric enzyme' C 2 'one partially occupied position for the inhibitor' D 2 'one partially occupied position for the inhibitor'

But the problem is that in any case we still need something that gives us an information about the "pdb chain" value, telling us which ligand (metallic ion, solvent etc) interacts with which polypeptide chain. And something that takes the place of the "pdb TER line" whose definition is:

The TER records occur in the coordinate section of the entry, and indicate the last residue presented for each polypeptide and/or nucleic acid chain for which there are determined coordinates. For proteins, the residue defined on the TER record is the carboxy-terminal residue; for nucleic acids it is the 3'-terminal residue.

So I guess that, probably in an unofficial way, the _atom_site.auth_asym_id has become the way authors do tell the "reader" about the "pdb chain" value. This is of course 100% empirical but I have been using tons of mmCIF files from the wwpdb in the last months and I never found one that didn't follow this pattern. (The only strange thing is that sometimes the _atom_site.auth_asym_id doesn't start from A but from some other random letter).

And actually if you check the pdb to mmcif table you will see that _atom_site.auth_asym_id is considered as the Strand_ID in pdb files, and the Strand_ID should be the "pdb chain" value (if I'm not wrong). While there is no correspondence for _atom_site.label_asym_id. So I guess that using _atom_site.auth_asym_id as "pdb chain" value could be considered as an "almost standard" way of solving the problem.

In any case I am afraid that there will never be a clean solution to this thing as the standard is never completely followed and isn't always easy to interpret but to do MD simulations it is necessary to have both "pdb chain" like and "pdb TER line" like information.

openmm / pdbfixer

output mmCIF doesn't distinguish ATOMs from HETATMs and _atom_site.label_asym_id is overwritten by _atom_site.auth_asym_id #195