Add tools for single molecule property calculation/storage based on matching reference simulation data

bmanubay commented 8 years ago

Hey guys,

Looking specifically at the single-molecule-property-generation directory, I've added my script for reading the amber topology files and creating the empty pandas dataframe around the unique bond length, angles and torsions I pull from those topologies. That all works fine. I'm currently working on editing Michael's calculation script to correctly interface with the dataframe.

All of the code for calculations of properties and their uncertainties have been added, I'm just working on getting the iteration in the get_property_from_trajectory function in trajectory_parser.py working. I wanted to use the molecule names (they're strings) as the row indices for iteration to avoid having to use some meaningless numbered index, but pandas doesn't do positional indexing with strings. Currently working on that.

bmanubay commented 8 years ago

@davidlmobley Hey David, do you happen to have a code snippet for creating the OEMol structure from a SMILES string or IUPAC? I want to try to get the single molecule property calculation procedure updated to phase out the prmtops like we discussed in the meeting earlier today.

davidlmobley commented 8 years ago

Sure, @bmanubay . I was going to get you a full worked example of starting a simulation from a SMIRFF ffxml file and an OEMol tomorrow, but I can get you that code snippet right now:

from openeye import oechem
smiles = 'cccc'
mol = oechem.OEMol()
status = oechem.OEParseSmiles(mol, smiles)
if not status:
   print "Error"

For IUPAC, it would be:

from openeye import oechem
from openeye import oeiupac
mol = oechem.OEMol()
name ="phenol"
oeiupac.OEParseIUPACName(mol, name)
#Bonus: Create SMILES from name
smiles = OECreateIsoSmiString(name)

Note that those are what you asked for, but you don't actually get coordinates in your molecule if you do that. :) So you'd presumably want to do that after you have an OEMol. Easiest is just to use openmoltools.openeye.generate_conformers with maxconfs = 1, which is a wrapper for OpenEye's OMEGA conformer generation functionality (https://github.com/choderalab/openmoltools/blob/master/openmoltools/openeye.py#L187)

For an example of how this would be used in an actual workflow, note that loading a forcefield from FFXML works like this (using smarty):

from smarty import *
from smarty.forcefield import *
forcefield = ForceField(get_data_filename('forcefield/Frosst_AlkEtOH_parmAtFrosst.ffxml'))

and then creating an OpenMM System from an OEMol and forcefield is as simple as this:

topology = generateTopologyFromOEMol(mol)
system = forcefield.createSystem(topology, [mol])

So, you'd just use the code above to generate an OEMol from whatever starting point you want, generate conformers for it, pull the Frosst_AlkEtOH_parmAtFrosst.ffxml from smarty and load it, then create a topology using the tool from Smarty and then create a system from your topology and OEMol. Done.

You may also need to copy in starting coordinates from your OEMol, i.e. as in https://github.com/open-forcefield-group/smarty/blob/master/smarty/forcefield_utils.py#L78.

That gets you to the same point (having an OpenMM system with parameters and coordinates) that you would have been if you had created an OpenMM system from an AMBER prmtop and crd file, and will be the primary way we are doing things going forward. Basically, this makes the AMBER formats obsolete for our purposes, as we now have a new SMIRFF force field format which is superior.

Let me know whether you still want the full worked example tomorrow.

(If you want more examples relating to this, you might want to see my code comparing energies for the AlkEthOH set obtained from AMBER prmtop and crd files (the old way of doing things) with those obtained starting fro OEMol's and the SMIRFF FFXML file, here: https://github.com/open-forcefield-group/smarty/blob/master/examples/SMIRFF_comparison/compare_molecule_energies.py)

Another side note: At the moment we've typically not been creating OEMols from names, though; for AlkEthOH we're just reading them from pre-existing mol2 files using something like:

mol = oechem.OEGraphMol()
ifs = oechem.oemolistream(mol_filename)
flavor = oechem.OEIFlavor_Generic_Default | oechem.OEIFlavor_MOL2_Default | oechem.OEIFlavor_MOL2_Forcefield
ifs.SetFlavor( oechem.OEFormat_MOL2, flavor)
oechem.OEReadMolecule(ifs, mol )
oechem.OETriposAtomNames(mol)

(The "flavor" stuff is there because the OpenEye readers assume TRIPOS atom types and these use parm@frosst atom types so we have to tell the reader to NOT try to process it as TRIPOS atom types but instead to later assign tripos types via OETriposAtomNames)

I can't guarantee you will get the same atom numbering if you create one of the AlkEthOH molecules from its name/SMILES as you would get from the existing file in AlkEthOH, though you probably will since the tools used are the same.

(Probably also relevant to @mrshirts )

mrshirts commented 8 years ago

Thanks for the examples. @bmanubay can fill in more details, but we really only need to make sure the atom ordering the same as the output the OpenMM simulation provides -- the .nc file provides the coordinates. If the process for AlkEthOH is just from .mol2, then we'll probably want to start from that one.

Let me know whether you still want the full worked example tomorrow.

We definitely want that one. Bryce can say if he needs it tomorrow or later.

davidlmobley commented 8 years ago

@mrshirts - he wrote me on Slack that he thought this was enough. @bmanubay , preferences?

mrshirts commented 8 years ago

I'd say the full worked example will be good soon (in case there are other unanticipated things that pop up), but this should be plenty for now.

davidlmobley commented 8 years ago

@mrshirts @bmanubay - your example is here: https://github.com/open-forcefield-group/smarty/pull/71

davidlmobley commented 8 years ago

(Also it is now in the main smarty under examples/SMIRFF_simulation)

mrshirts commented 8 years ago

Thanks!

davidlmobley commented 8 years ago

@bmanubay - I'm doing some review of this, though I know you're still working on it. One thing I think would be helpful is to remove the "API changes to the README.md" stuff from your pull request; our current API is https://github.com/open-forcefield-group/open-forcefield-tools/blob/master/README.md and if there are (minor) changes to be made to that, it needs to come in as a separate PR with relevant discussion. These changes you have here are, I assume, here accidentally as they are the ones John rejected in #15 .

But in any case, aside from that -- I can see you have a structure here that is basically working for pulling bond, angle, torsional information from NC files and putting it into a dataframe, though I don't totally understand the data structure. It looks like maybe you're reading an AMBER parameter file to identify all bonds, angles, and torsions, then using that to drive creation of a dataframe, and then from there you process the trajectory to pull data on all those bonds, angles, and torsions? And it looks like so far you only have, for torsions, the first term in the Fourier expansion?

Because it's crucial that we advance to using these types of simulations to make updates to parameters in the immediate future, likely before John has the property calculation framework in place (since that's taking so long), it's probably important that we push ahead with this type of thing as an interim framework to just get us sampling over parameters before the full API is implemented.

So, if I were to try to turn this into something I could use, here's how I think I would do it:

Don't start with AMBER parameter/coordinate files, since these are going away. Start with an OpenMM Topology, which has atoms and connectivity. Use this information to populate your bonds, angles, torsions. (Note that you can GET an OpenMM topology from AMBER parameter/coordinate files via using ParmEd to read them and create an OpenMM system, but you can also get one from an OEMol, i.e. https://github.com/open-forcefield-group/smarty/blob/master/smarty/forcefield.py#L82)
Allow your tool to optionally select only specific bonds, angles, torsions for measurement via SMIRKS patterns; these are how we're assigning parameters now (see https://github.com/open-forcefield-group/smarty/blob/master/smarty/forcefield.py#L221 for code which finds atoms/bonds/angles/torsions to apply parameters to) so I might ask it to measure all bonds corresponding to a specific SMIRKS, for example.
Update documentation strings/usage examples a lot so it's clear exactly what goes in where in what format and how it comes out

Thanks. Please ask questions tagging me if you need guidance/help.

bmanubay commented 8 years ago

Got it, thanks @davidlmobley ! That getSMIRKSMatches function I think is exactly what I need for ensuring the continuity of the tuples that make up the atom indexing for a particular bond/angle/torsion/whatever! Thanks!

I've actually been trying to look through the smarty documentation to find an efficient way to change parameters (from a python script) in order to get working on the uncertainty estimates for bond moves (https://github.com/open-forcefield-group/open-forcefield-tools/issues/19#event-730664993). Might that be more of a @jchodera question?

davidlmobley commented 8 years ago

I've actually been trying to look through the smarty documentation to find an efficient way to change parameters (from a python script) in order to get working on the uncertainty estimates for bond moves (#19 (comment)). Might that be more of a @jchodera question?

This does not exist yet. Right now one would have to work directly on the XML.

We will, I think, need a Python interface for working with these as working with the XML itself seems to me to be an awfully silly way of doing it. This would also solve some other problems. So, you may want to comment on what we need: https://github.com/open-forcefield-group/smarty/issues/68

(I'm also open to contributions to the interface itself.)

davidlmobley commented 8 years ago

For the time being though, @bmanubay , for testing purposes, you should just manipulate the XML files.

bmanubay commented 8 years ago

Great! Thanks for the information David! I guess it's not so elegant, but I can certainly make that work.

davidlmobley commented 8 years ago

@bmanubay - please do comment on the issue of what interface we would like for manipulating these, as we do need to make something better than operating on the XML, so now (before we start making something) is the best time to do so.

bmanubay commented 8 years ago

Will do!

bmanubay commented 8 years ago

Hey @davidlmobley, quick question/issue with using one of the classes in the forcefield.py file in smarty.

If I were to do the following, I still can't seem to access the functions in the _Topology class (where the getSMIRKSMatches and _identifyMolecules functions are):

import glob
from smarty import *

mol_files = glob.glob('./Mol2_files/AlkEthOH_*.mol2')
molnames = []

for i in mol_files:
    molname = i.replace(' ', '')[:-5]
    molname = molname.replace(' ' ,'')[13:]
    molnames.append(molname)

OEMols=[]
for i in mol_files:
    mol = oechem.OEGraphMol()
    ifs = oechem.oemolistream(i)
    flavor = oechem.OEIFlavor_Generic_Default | oechem.OEIFlavor_MOL2_Default | oechem.OEIFlavor_MOL2_Forcefield
    ifs.SetFlavor(oechem.OEFormat_MOL2, flavor)
    oechem.OEReadMolecule(ifs, mol)
    oechem.OETriposAtomNames(mol)
    OEMols.append(mol)

tops = []
for i in OEMols:
    top = generateTopologyFromOEMol(i)
    tops.append(top)

molecules = []
for i in tops:
    a = _Topology(i)
    b = a._identifyMolecules()
    molecules.append(b)

The error stack coming out is:

Traceback (most recent call last):
  File "read_top.py", line 31, in <module>
    a = _Topology(i)
NameError: name '_Topology' is not defined

I want the atom mapping from the topologies I generated and I think this is the function for doing so (really, I think it should do so automatically from the __init__() function, if I'm reading correctly) . Those last few lines would be the right way to access a function inside a class, right? I checked just using the python help function and _Topology wasn't one of the listed classes, so I'm wondering if there's something about the structure of smarty module that I'm misunderstanding? Is there some specific way I need to import the class?

This is probably a pretty simple issue I imagine, but my understanding of Python constructs isn't as good as it should be it seems.

davidlmobley commented 8 years ago

@bmanubay - I'm not actually sure what precisely you're trying to do here (for example, identifyMolecules returns nothing, so using it as you're attempting to won't get you anywhere; and it looks like you're trying to just end up with a set of oemols, but you started off with a set of oemols, so I'm confused), but you can resolve your specific error by adding from smarty.forcefield import _Topology; private member functions/classes/etc don't get imported by import * so you have to ask for them specifically.

Please note that I just added a "forcefield_labeler" module which provides ForceField_labeler class structure mirroring ForceField, but which has labeler.labelMolecules which provides access to which atoms and which parameters are used in a set of specified molecules (oemols). It's possible that that functionality may be some of what you're after? There is now an example in smarty/examples/label_molecule on using this to access what atom indices get what force terms.

In your case, if you're just trying to figure out what bonds/angles/torsions/etc. are in a specified set of molecules and covered by FF parameters, you should be able to just apply the labeler directly to your list of molecules (given a forcefield xml file) and get back a list of these terms. See the example noted above.

bmanubay commented 8 years ago

Alright understood! To answer your question I just want the atom index tuples that correspond to the unique bonds, angles and torsions. That's it.

If I understand correctly, connectivity info is contained in the OEMol I just don't know how to access it. I'll look through your example though; thank you!!

davidlmobley commented 8 years ago

@bmanubay :

Alright understood! To answer your question I just want the atom index tuples that correspond to the unique bonds, angles and torsions. That's it.

OK, if you're lazy (that is, if you want to solve this without writing new code), then my labeler will do that for you, so you may want to look there.

If I understand correctly, connectivity info is contained in the OEMol I just don't know how to access it. I'll look through your example though; thank you!!

This is actually a "better" way to solve the problem, since the labeler above requires you to provide a forcefield, and is giving you info about what parameters would be applied to those bonds, angles, and torsions.

Note that traversing the bonds in an OEMol is done via:

for bond in molecule.GetBonds():
    start = bond.GetBgn() #Start atom
    end = bond.GetEnd() #End atom
    #Names are start.GetName(), end.GetName()
    #Probably numbers are start.GetNumber() or start.GetID() or something along those lines, see docs.

angles presumably you get by looping over the neighbors of the start or end atom, and similarly with torsions. The OEChem documentation should describe how to do this; I've not done it myself. It may be a useful exercise.

bmanubay commented 8 years ago

Noted! Thanks again! I'll take a closer look through the the OEChem docs to get a better idea of the functionality.

davidlmobley commented 8 years ago

In my opinion the OE tools are well worth your time. There's a learning curve for sure, but it makes working with molecules VASTLY easier.

davidlmobley commented 7 years ago

@bmanubay - if you want to get this merged, can you do some cleanup of the pull request itself to explain what is being done here, summarize changes, add documentation, etc.? Alternatively, bring in the updated property calculation document via a separate PR.

(In general, we should be trying to do pull requests which resolve one specific issue which is either (a) already discussed in an issue in the issue tracker, or (b) clearly laid out in the initial comment on the pull request. This one is neither, and it's not clear from what's here on GitHub what all exactly it's bringing in.)

(A concrete example of the type of thing one might want a pull request to look like: https://github.com/open-forcefield-group/smarty/pull/134 -- it references an existing issue, lays out a checklist of what will be implemented to resolve the issue, checks off the items as implemented, etc.)

bmanubay commented 7 years ago

@davidlmobley

Sure I'll get this cleaned up! Sorry for some of the less than adequate commit messages. A lot of the file updates are just new simulation trajectories that I've been using in developing the reweighting procedure. I'll get some better documentation added on and put it up.

davidlmobley commented 7 years ago

@bmanubay @mrshirts - I thought we said we were going to move towards getting this merged and then bringing in changes in a well-documented, one-piece-at-a-time manner? Right now this is a single PR with 181 commits, most of which are not explained and have commit summaries that are incomprehensible to anyone else (descriptions like "increase amp", "updates", "some empty bins with 100 bins", etc.), and it's not explained anywhere what the pull request actually does in total. I have some idea from our recent discussions, but that's not visible to anyone else here on GitHub.

We really need to move towards a model where: 1) New things which are going to be done are described/discussed in issues 2) Pull requests clearly explain what they do, and tie it to (and reference!) any issue or issues they resolve 3) Commits have a clear though brief summary of what was actually done

The idea should be that someone else on here can review pull requests that come in, using solely the info available on GitHub to assist them to decide whether the pull request is appropriate, coded well, and deals with the issues it says it dealt with. (The idea is to use code review, by others, of pull requests to help make sure everything deposited is functional, well documented, extensible, reusable, etc.)

We're also trying to use "[WIP]" pull requests as a mechanism to have people comment on progress towards resolving a specific issue or issues; having them stay open for months without it being clear what is being implemented is not a good way to do this.

(I also get a notification every time a new commit is pushed to this branch because of this PR, so I get lots of notifications but it's unclear if I should be reviewing anything.)

Sorry if this seemed like a rant. It's not intended as a rant; I'm just trying to help you understand how I was expecting this to work, versus how it is actually working. :)

bmanubay commented 7 years ago

@davidlmobley I'll take care of it David! On my todos for the week. Sorry this turned into a mess!

mrshirts commented 7 years ago

I think one of the solutions for Bryce to be 1) committing frequently to a "testing" branch in his own repository, but 2) then having a branch that only contains the successful tests, and 3) only committing that branch into the master when features that other people will use are finalized.

The "testing" branch might have a bunch of very short commits, but when anything was committed back into master, they would be condensed all together with one notification.

davidlmobley commented 7 years ago

@mrshirts - I'm not as worried about the issue of the number of commits as much as having them come in an organized manner where it's explained what they are doing. For example, a PR with 100 commits in it is still easy to review if they come in in sensible blocks where it is explained what is being done and why. (This is easy to achieve, even with very frequent commits locally, if the commits are only pushed when logical blocks are finished; this tends to be what I do when I'm developing since I commit a lot in case I screw something up and need to backtrack, but I tend not to push until I've finished a "unit of work" by some milestone (usually checking off an item in a checklist I have up in my "[WIP]" PR.)

But yes, having a testing branch may be a good idea too, since that way, he can commit all he wants and then only move things over to his branch he's going to do pull requests from when they actually work.

bmanubay commented 7 years ago

These are both good ideas. I'm going to do some cleaning up right now and then tag one of you guys to merge the PR (since I don't think I can do it myself). Thanks!

davidlmobley commented 7 years ago

Sounds good, @bmanubay .

bmanubay commented 7 years ago

@davidlmobley All cleaned up and ready to go!

Summary of major changes/additions: 1) manipulateparmaeters.py : The script that I've been using to carry out parameter reweighting analysis. Carries out reweighting with MBAR from a prescribed state and compares reweighted observables (like average bond length or its variance) to values from full MD simulations. Results from this are less critical now that we're considering surrogate modeling as the major efficiency booster in the posterior sampling process. Still useful for getting accurate local approximations of observables in order to construct better surrogates for MCMC sampling. Still a work in progress, but we have some rough idea of how far in parameter space you can jump and still have accurate observable estimates (at least for bonds and angles).

2) least_squares_fit_example.py: Script I'm using to develop fits for torsion probability histograms. The fitting form is currently a fourier series approximation with sin^2 terms. Parameters for the fit are fourier coefficients A_i, periodicities P_i and phase angles phi_i. Still a work in progress, but fourier series fits seem like a promising avenue for describing torsion distributions.

3) construct_posterior.py: Where I'm developing the posterior sampling process for bonds, angles and torsions parameterization. Still in early infancy and working on fitting data to surrogate models in the meantime. This mainly represents my efforts in writing an MCMC algorithm and convincing myself of the process outline. Integration with MBAR and MCMC with a surrogate model to come soon and will appear on my fork.

davidlmobley commented 7 years ago

Thanks, @bmanubay . Since reviewing the change log in this PR is not particularly practical -- have you provided README.md files explaining your directory structure (e.g., subdirectories having README.md files explaining what's in them, and the overall README.md explaining what the subdirectories are) AND have you put the summaries you noted above into a README.md file somewhere?

(I can dig for this, obviously, but asking is a starting point...)

bmanubay commented 7 years ago

@davidlmobley Let me go ahead and update them with a little more detail

bmanubay commented 7 years ago

Alright, all updated!

davidlmobley commented 7 years ago

OK, thanks, @bmanubay . Let's just go ahead and merge and I'll complain at you later if things are unclear/need improving. Thanks!

openforcefield / open-forcefield-tools

Add tools for single molecule property calculation/storage based on matching reference simulation data #18