Missing terminal oxygen after addMissingAtoms()

MauriceKarrenbrock commented 4 years ago

versions: pdbfixer 1.6 openmm 7.4.0 py36_cuda101_rc_1 omnia

Hello, I noticed that after using addMissingAtoms() the terminal oxygen of the last residue of some chains keeps missing. For the work I'm doing I am only checking chain A (of the first model if it is a NMR structure) so I don't know what happens to the other chains of the given mmCIFs.

some example protein mmCIF files from the wwpdb are: 2gz7 2gz8 5aut 4zzy 4zei 1a8i ecc... As said I did only check chain A

import pdbfixer
import simtk.openmm.app

list = ['2gz7',
'2gz8']

for id in list:
    with open(f"{id}.cif", 'r') as f:
        fixer = pdbfixer.PDBFixer(pdbxfile = f)
        fixer.findMissingResidues()
        fixer.findNonstandardResidues()
        fixer.replaceNonstandardResidues()
        fixer.findMissingAtoms()
        fixer.addMissingAtoms()

    with open(f"{id}_repaired.cif", 'w') as f:
        simtk.openmm.app.pdbxfile.PDBxFile.writeFile(fixer.topology, fixer.positions, f, keepIds= True)

Thank you very much in advance!

p.s. after how long would the software updates here on github be available on conda? (I am asking in case the problem was inside openmm and not pdbfixer, because I have never been able to compile openmm from source and make the python wrapper work)

peastman commented 4 years ago

I can reproduce the problem. Using the PDB version of 2gz7 it adds the terminal oxygen, but using the PDBx/mmCIF version it doesn't. Let me investigate and see what's happening.

We don't have a date yet for the next OpenMM release, but we've mostly finalized the feature list and we're just trying to finish up the remaining features. So hopefully it won't be too long. You also can install the latest nightly build with

conda install -c omnia-dev openmm

peastman commented 4 years ago

I'm not sure what to make of this. Unfortunately, there are a lot of ambiguities in the definition of the format, and a lot of inconsistencies between what the documentation says to do and what you find in practice. (Just like PDB. I'd hoped things would be better with the new format, but they don't seem to be.)

There are two sets of duplicate properties for atoms, one set starting with the prefix label and one starting with the prefix auth. It's often unclear what the difference between them is supposed to be. For identifying chains, the two properties you could potentially use are _atom_site.label_asym_id and _atom_site.auth_asym_id. According to http://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html#ATOMP, _atom_site.auth_asym_id is the recommended equivalent for chain ID in a PDB file, so that's what OpenMM looks at.

But in the case of 3gz7.cif, that property is A for every atom, whereas _atom_site.label_asym_id has different values to mark the different chains. So in this particular file, that's clearly the one that was intended to be used to mark chains. But in other files it may well be the other. Which is really a mess!

Perhaps we should look at both properties and start a new chain if either one changes? Or see which one has more distinct values and use that one?

MauriceKarrenbrock commented 4 years ago

I see what you mean, and yes it is quite a mess. This is what I have understood yet: _atom_site.auth_asym_id is the pdb chain that can contain both ATOM and HETATM (like a piece of protein, a ligand and some water) exactly like in the old pdb. While _atom_site.label_asym_id changes any time it meets something different like a ligand or some water even if they are still part of the same chain. 2gz7 is actually made of only chain A as you can see on the description on https://www.rcsb.org/structure/2gz7 and that's why _atom_site.auth_asym_id never changes, but in the chain there is a ligand and some waters and _atom_site.label_asym_id changes A for the protein, B for the ligand, and C for the waters.

But I noticed that the mmCIFs that give problems are made of only one chain, so I guess that the problem is that when reading a mmCIF openmm is not able to understand when a chain ends if there is not another chain after it. So I think that you should keep checking the _atom_site.auth_asym_id as you already did. The only thing to do is to tell openmm that the last residue of a structure shall be considered the end of the current chain even if there is no other chain after it.

peastman commented 4 years ago

The only thing to do is to tell openmm that the last residue of a structure shall be considered the end of the current chain even if there is no other chain after it.

If it were the last residue of the structure it would be considered the end of the chain. But in the file, the last protein residue is immediately followed by the D3F heterogen, and then by a lot of water. OpenMM thinks all of that is a single chain, because _atom_site.auth_asym_id has the same value for all of them. So as far as it's concerned, the terminal residue is the last water molecule.

MauriceKarrenbrock commented 4 years ago

I see, but if I am not wrong this happens in pdb files too. And as you said on the pdb the oxygen is added, so I guess that some inspiration could be found in the function that reads the pdb files.

In any case a, maybe not very robust, patch could be this one. A chain could end in three ways A) there are no hetero atoms so _atom_site.auth_asym_id changes (for example from A to B) or the mmcif file ends and the last atom has ATOM as first value

B) there are hetero atoms so the _atom_site.auth_asym_id doesn't change but both the first value changes from ATOM to HETATM and _atom_site. label_asym_id changes

C) sometimes the metallic atoms are still called ATOM (even if they should be called HETATM) but we hope that _atom_site.label_asym_id does change

So I guess that the function could check all three values and, in the case we want to consider only option A and B the program could consider for each value of _atom_site.auth_asym_id the first ATOM as the beginning of the chain and the last ATOM as it's end ignoring all the HETATM. If we want to consider case C too the program could consider the first ATOM as the beginning and then consider the final atom the first one that makes _atom_site.label_asym_id change (of course hoping that at least that will change)

Could it make sense?

MauriceKarrenbrock commented 4 years ago

I checked how the pdb is parsed and it is done in a similar way to what I imaged. It both checks the chain id and if there are some TER lines, any time it finds a TER line it creates a new chain object but it allways keep reading the chain id from the pdb in order to get a chain with ATOMs and a different chain for HETATMs for any chain id, and any time the chain id changes it will create a new chain object with a new chain id (read on the pdb). And I guess that the function that adds the hydrogen and the oxygen at the first and last residue of a chain doesn't do it if that residue is a HETATM.

In this way you keep the right chain id when you write the structure back on a file but have all the needed atoms

this can be seen in https://github.com/openmm/openmm/blob/master/wrappers/python/simtk/openmm/app/internal/pdbstructure.py lines 339 - 350

    def _add_atom(self, atom):
        """
        """
        if len(self.chains) == 0:
            self._add_chain(Chain(atom.chain_id))
        # Create a new chain if the chain id has changed
        if self._current_chain.chain_id != atom.chain_id:
            self._add_chain(Chain(atom.chain_id))
        # Create a new chain after TER record, even if ID is the same
        elif self._current_chain.has_ter_record:
            self._add_chain(Chain(atom.chain_id))
        self._current_chain._add_atom(atom)

On mmCIFs there are no TER lines but we have _atom_site.label_asym_id that can do the same thing.

I tried to understand how the mmCIF reader works but probably because of my inexperience I didn't understand much of it, so I don't know how it manages the chain beginning and ending

MauriceKarrenbrock commented 4 years ago

I just noticed one thing that could be part of the problem: when you fix a mmCIF with pdbfixer and there are some HETATMs on the output mmCIF they all become ATOMs. So maybe a part of the problem is that the parser doesn't distinguish between atoms and hetero atoms

for example with 5aol this

HETATM 3230 F  F3  . UFV C 2 .   ? 93.580  92.520  -40.900 0.60 14.60 ? 1293 UFV A F3  1 
HETATM 3231 ZN ZN  . ZN  D 3 .   ? 96.551  70.205  -20.453 1.00 10.87 ? 1300 ZN  A ZN  1

becomes this

ATOM   1562  F3  UFV A1293      93.580  92.520 -40.900  0.00  0.00           F  
ATOM   1563 ZN    ZN A1300      96.551  70.205 -20.453  0.00  0.00          ZN

peastman commented 4 years ago

We try to be pretty flexible in how we parse PDB files, since those also tend to be nonstandard. It adds a chain break for either a TER record or a change in the chain ID. For a standard file those should happen together, but in practice you often get only one or the other.

Since PDBx/mmCIF doesn't have TER records, we have to rely on chain IDs. Which would be fine if only everyone agreed on which field to put them in! I'm thinking the most robust solution will probably be if we scan the values in both columns in advance and try to figure out which one really has chain IDs.

Inserting a chain break when it switches between ATOM and HETATM would be incorrect. If a protein contains a nonstandard amino acid that's considered a heterogen, but it's still part of the same chain, and most likely you'll want to replace it with a standard one.

MauriceKarrenbrock commented 4 years ago

We try to be pretty flexible in how we parse PDB files, since those also tend to be nonstandard. It adds a chain break for either a TER record or a change in the chain ID. For a standard file those should happen together, but in practice you often get only one or the other.

I don't know what the official pdb standard says but TER is often used to distinguish different molecules (that are not covalently bounded) inside the same chain (protein of chain A from organic ligand of chain A from metal from solvent). In fact, for example, Gromacs MD program raises an error if there is not a TER to separate the protein from any hetero atom inside the same chain. While chain id distinguishes between different polypeptidic chains including its hetero atoms.

Since PDBx/mmCIF doesn't have TER records, we have to rely on chain IDs. Which would be fine if only everyone agreed on which field to put them in! I'm thinking the most robust solution will probably be if we scan the values in both columns in advance and try to figure out which one really has chain IDs.

Usually the mmCIFs I met have the _atom_site.auth_asym_id = pdb chain id and the _atom_site.label_asym_id usually tells you when you go from a molecule to another like the TER often does in pdb. Then of course we all know that rules are often broken so if you would like to check which one is the chain and which one the TER you could look at which one changes more often, the chain id should change less often than the "TER" id (that distinguishes between molecules).

I think that if you change chain any time _atom_site.label_asym_id changes but read the chain name from _atom_site.auth_asym_id it should work because it would be very similar to the pdb function (that does work properly). Of course this would only work if _atom_site.label_asym_id changes more often than the other one, otherwise switch names. I think it should be given a shot.

In any case I know it's not easy to deal with this kind of files.

Ps should I open another issue for the fact that HETAMs like the organic ligand and metal ions are transformed in ATOM ?

peastman commented 4 years ago

The fix for this is in https://github.com/openmm/openmm/pull/2563. Following your suggestion, I made it add a break between chains when either of the two identifiers changes.

MauriceKarrenbrock commented 4 years ago

Thank you very much!

But I downloaded the nightly build with conda install -c omnia-dev openmm the 21th of February at 1.20 am (California time) and nothing changed from before.

import pdbfixer
import simtk.openmm.app

input_filename = '2gz7.cif'
output_filename = '2gz7_repaired.cif'

with open(input_filename, 'r') as f:
    fixer = pdbfixer.PDBFixer(pdbxfile = f)

fixer.findMissingResidues()
fixer.findNonstandardResidues()
fixer.replaceNonstandardResidues()
fixer.findMissingAtoms()
fixer.addMissingAtoms()

with open(output_filename, 'w') as f:
    simtk.openmm.app.pdbxfile.PDBxFile.writeFile(fixer.topology, fixer.positions, f, keepIds = True)

This are the 2 mmCIF files uploaded as txt: 2gz7_repaired.txt 2gz7.txt

As you can see the terminal oxygen in the last GLN residue is still missing, the _atom_site.label_asym_id has been overwritten by _atom_site.auth_asym_id (loosing any information about when a molecule ends and another one begins in the same chain), and all the HETATMs has become ATOMs.

It seems so strange to me that nothing changed, maybe hasn't the pull request been included in the last nightly build?

peastman commented 4 years ago

Try this:

from simtk.openmm import *
print(version.git_revision)

What does it print? That will tell us exactly which version you have installed.

MauriceKarrenbrock commented 4 years ago

e244991411e16a6f946adfac984de8fcbc2d84f7

peastman commented 4 years ago

That revision is from Oct. 31 of last year. Are you using Python 2.7? We dropped support for it several months ago. Current dev builds are only created for 3.6 and later.

MauriceKarrenbrock commented 4 years ago

Python 3.6.9 :: Anaconda, Inc.
[GCC 7.3.0]

conda 4.8.2

Ubuntu 18.04.4 LTS
Kernel: Linux 5.3.0-40-generic

Maybe there is some kind of date and time problem because I am not in the USA?

peastman commented 4 years ago

That seems unlikely. The list of files at https://anaconda.org/omnia-dev/openmm/files shows a Linux build for Python 3.6 was uploaded about 16.5 hours ago. If you use the command

conda install -c omnia-dev openmm

you should get it. You could also try which python and which conda to make sure it's really using the ones you think it is.

MauriceKarrenbrock commented 4 years ago

Hello, by removing my old conda environment and creating a new one from scratch it downloaded me the right version 36b6caa67e5f485023ec12d40e12f1700e19f618 and the terminal oxygen does appear. Thank you very very much and sorry if I bothered you!

PS

the _atom_site.label_asym_id has been overwritten by _atom_site.auth_asym_id (loosing any information about when a molecule ends and another one begins in the same chain), and all the HETATMs has become ATOMs

This part is still true but I will open a new issue for it, as it is a different problem

openmm / pdbfixer

Missing terminal oxygen after addMissingAtoms() #194