Proposals for handling residue replacement

jchodera commented 10 years ago

This thread breaks out the discussion from #31 into a separate thread about how we should handle residue replacement.

I would like to propose we work toward having pdbfixer use three possible modes for whether and how it replaces residues and fills in missing atoms:

Mode 1: Conservative operation / no substitutions. Any missing atoms should be added, but no residues should be replaced with different residues. The chemical component dictionary should be consulted for how to fill in missing residues based on residue names.
Mode 2: Aggressive substitute with standard amino acids. Any residues that are not non-standard amino or nucleic acids should be replaced with standard ones using an internal substitution table. This is how things work now if we ask for automatic replacements. This is very helpful in making every PDB file computable for large-scale surveys.
Mode 3: ffxml-directed substitution. An ffxml file is used to specify which residue types are available, and substitutions use a graph-match algorithm to find the most similar residue in the ffxml file (using connectivity and element names) to substitute. This is very helpful if we want to represent the system as realistically as possible within the forcefield we're using, which may include things like phosphoserine/tyrosine residues for kinases, etc.

All of these modes are slight variations of the current pdbfixer behavior, so I think it is best to build these into pdbfixer, rather than making separate tools for these.

jchodera commented 10 years ago

Any comments on this proposal?

peastman commented 10 years ago

Unless I'm missing something, modes 1 and 2 are already supported. They correspond to clicking "Select None" or "Select All", respectively, on the page for replacing residues.

In the case of mode 1, adding missing atoms to nonstandard residues is impossible. They're not standard residues. We don't know what atoms they're supposed to contain.

So mode 3 is the main new thing. It sounds very interesting, though I don't know how well the graph matching would work. Especially if the nonstandard residue also happens to be missing some atoms. And I'm not really clear what use cases this would be valuable for. If you want to process a whole lot of files in an automated way, and you don't care about any one of them being as realistic as possible, mode 2 is probably fine. On the other hand, if you really care about making each model realistic, you probably want to select the substitutions by hand rather than trusting it to decide automatically.

jchodera commented 10 years ago

In the case of mode 1, adding missing atoms to nonstandard residues is impossible. They're not standard residues. We don't know what atoms they're supposed to contain.

From my proposal:

"The chemical component dictionary should be consulted for how to fill in missing residues based on residue names."

jchodera commented 10 years ago

On the other hand, if you really care about making each model realistic, you probably want to select the substitutions by hand rather than trusting it to decide automatically.

"by hand" is not something that makes any sense in the context of tools meant for automated, large-scale pipelining. If you wanted to do everything "by hand", you would be re-refining atoms from electron density, not running pdbfixer.

To give you an example from our workflow: We want to model ~500 human kinases from crystal structures. Some of these kinases may have post-translational modifications, such as phosphoserines, phosphothreonines, and phosphotyrosines. Suppose we have an ffxml forcefield specification that contains these residues. We would want pdbfixer to work its magic, but not replace phosphorylated residues with their nonphosphorylated forms. We could build our own substitution table, but it would be optimal if we didn't have to, and instead just passed an ffxml file with both residue forms.

We could require ffxml residue definitions contain the name of the corresponding residue from the PDB Chemical Component Dictionary. If we did this, it would make it easier for missing atoms of the appropriate residues to be filled in.

peastman commented 10 years ago

We could build our own substitution table, but it would be optimal if we didn't have to, and instead just passed an ffxml file with both residue forms.

I don't think you'd be satisfied with the results, if you tried to do this by graph matching. What you just described is a much simpler problem: just a list of particular residues you don't want it to replace. That's very different from having some unrelated, nonstandard residue, possibly with some missing heavy atoms, and then it gets replaced by a phosphoserine because the graph matching algorithm just happened to decide that was the closest match to the atoms that happened to be present.

Let me offer an alternate proposal: you can register new residue types, providing all the information it needs about them. Once you do that, it will treat them just like standard residues. It won't try to replace them with something else, and it will know how to add missing atoms to them.

jchodera commented 10 years ago

Let me offer an alternate proposal: you can register new residue types, providing all the information it needs about them. Once you do that, it will treat them just like standard residues. It won't try to replace them with something else, and it will know how to add missing atoms to them.

Couldn't we have it process an ffxml file to register these new residue types?

peastman commented 10 years ago

The information it needs is different from what's in a force field. It needs a PDB file (or equivalent information) giving a reference structure for each one. Take a look in the "templates" directory.

jchodera commented 10 years ago

Once again, the Chemical Components Dictionary provides this information if the residue appears in the PDB. If it's a small molecule, parameterization engines like gaff2xml also need coordinates to assign parameters (specifically, charges), so coordinates would presumably be available.

If coordinates are not available, it is possible to generate them using the forcefield terms (bonds, angles, torsions), though this is a bit of a dicier proposition. It shouldn't be a huge problem to do something sensible from bonds, angles, and torsions using OpenMM, however.

peastman commented 9 years ago

I'm looking into how we might be able to do this. In particular, if it's processing a file and comes across a residue it doesn't have a template for, it would automatically download one and use that.

Ligand Expo (http://ligand-expo.rcsb.org/ld-download.html) has structures for any residue you're ever likely to come across. More precisely, they catalog every occurrence of each one in the PDB and let you download them independently. So we could use that to get coordinates and bond information. A potential problem is that a given occurrence won't necessarily have all the atoms. But they do generally seem to give the chemical formula, so we could just start at the top of the list of files and keep loading them until we find one with all atoms.

Another problem is that, while the files will give us all the bonds within the residue, they don't give the bonds between residues. But we can probably figure those out. If something occurs in a chain with amino acids, and if it includes atoms called "N" and "C", we can just assume those should be bonded to the adjacent residues. Likewise for DNA and RNA.

There's one problem I haven't figured out a solution to though: most (possibly all?) of these files include only heavy atoms. How can we identify where the hydrogens should go?

jchodera commented 9 years ago

Ligand Expo is just a browser to view the contents of the Chemical Components Dictionary I mentioned previously in this thread. I think it's a great idea to provide a mode of operation for pdbfixer where missing atoms are built in using coordinates/atoms from this CCD.

The big problem, as you note, is that we don't how how polymeric residues are supposed to be connected to each other. I think that's a big shortcoming, and something we might push the PDB to rectify. The best short-term workaround is for us to provide a "known" list of polymers that behave the same way, as you suggest, like DNA, RNA, and amino acids (natural and post-translational modified or unnatural).

For hydrogens, we can use tools like the OpenEye Toolkit to protonate the CCD database once and for all (they have a specific academic license for putting the results in the public domain), but this would protonate all sites. We'd still need to determine how to remove the protons that end up on atoms that are bound in polymers.

As I suggested before, I think we really want to support three overall modes of operation:

Mode 1: Conservative operation / no substitutions. Any missing atoms should be added, but no residues should be replaced with different residues. The chemical component dictionary should be consulted for how to fill in missing residues based on residue names.
Mode 2: Aggressive substitute with standard amino acids. Any residues that are not non-standard amino or nucleic acids should be replaced with standard ones using an internal substitution table. This is how things work now if we ask for automatic replacements. This is very helpful in making every PDB file computable for large-scale surveys.
Mode 3: ffxml-directed substitution. An ffxml file is used to specify which residue types are available, and substitutions use a graph-match algorithm to find the most similar residue in the ffxml file (using connectivity and element names) to substitute. This is very helpful if we want to represent the system as realistically as possible within the forcefield we're using, which may include things like phosphoserine/tyrosine residues for kinases, etc.

cxhernandez commented 7 years ago

There's one problem I haven't figured out a solution to though: most (possibly all?) of these files include only heavy atoms. How can we identify where the hydrogens should go?

Would something like Molprobity work?

jchodera commented 7 years ago

I think the LigandExpo chemical components dictionary actually does include hydrogens, so this may not be a problem. The "reduce" program uses this information.

cxhernandez commented 7 years ago

Ah, well then, +1 from me.

jchodera commented 7 years ago

Which functionality is most critical for you right now? The PDB functionality, or the forcefield template functionality?

cxhernandez commented 7 years ago

The forcefield templating would probably be most desirable. There's already mutation functionality, but adding PTMs would be awesome.

jchodera commented 7 years ago

Which forcefield(s) are you thinking of at the moment, and any PDB examples that would be good tests for a first pass at this feature?

cxhernandez commented 7 years ago

99SBILDN, since a lot of the simulations I've already run have been with that FF, and most likely FB15 for comparison.

So far, I've been working with the acetylated peptide in this PDB entry— so I have both the WT and acetylated peptide that could be used for testing.

openmm / pdbfixer

Proposals for handling residue replacement #44