Decide API for equilibrium expectation datasets

davidlmobley commented 8 years ago

We need to settle how to handle non-ThermoML datasets within the PropertyCalculator framework, most immediately for calculation of equilibrium expectation values of other properties.

As noted here ( https://github.com/open-forcefield-group/open-forcefield-tools/pull/3#issuecomment-227601790 ) the current draft API (https://github.com/jchodera/open-forcefield-tools/blob/master/README.md) being brought in via #3 envisions PhysicalPropertyMeasurement objects compiled into a PhysicalPropertyDataset. One specific example is given for ThermoMLDataset as a derivative of a PhysicalPropertyDataset, and that's certainly one of the key datasets we need in the short term.

One of the other things we plan in the short term is to run equilibrium MD simulations of the AlkEthOH set in the gas phase using parm@frosst parameters and compute equilibrium expectations - energies (total as well as possibly divided into components from individual bonds, angles, and torsions) and bond, angle, and torsional distributions. We would then use this (or subsets thereof) as a dataset for parameterization to see if we can recover the parameter set. So, we need to decide how this would be framed in terms of a dataset - either a PhysicalPropertyDataset or something else.

It seems like the description of the physical system we're interested in could fit within the existing framework of a PhysicalPropertyMeasurement, i.e.:

A Mixture specifying the substance that the measurement was performed on: Here, this would be a single component mixture
A ThermodynamicState specifying the thermodynamic conditions under which the measurement was performed: Here this would specify gas phase and the particular temperature.
A PhysicalProperty is the physical property that was measured: Here we would have multiple such measurements and the only significant change would arise here. Examples below for my case.
A MeasurementMethod specifying the kind of measurement that was performed: Here this would be EquilibriumMDSimulation as the source of our equilibrium expectations.

As noted elsewhere, each PhysicalPropertyMeasurement has several properties:

.substance - the Mixture for which the measurement was made
.thermodynamic_state - the ThermodynamicState
.value - the unit-bearing measurement value (it is not explicitly stated in the API docs, but I assume this would be, for example, a PhysicalProperty such asExcessMolarEnthalpy`, @jchodera ?)
.uncertainty - the standard uncertainty of the measurement
.reference - the literature reference (if present) for the measurement
.DOI - the literature reference DOI (if available) for the measurement

If we were to extend the API to allow equilibrium expectations from the envisioned gas phase MD simulations, our crucial need would be for suitable PhysicalProperty types. Probably we would need something like:

GasPhaseSingleMoleculeAverageEnergy
GasPhaseSingleMoleculeBondDistance
GasPhaseSingleMoleculeBondAngle
GasPhaseSingleMoleculeTorsionAngle

These would look for a definition of which molecule to be examined in the Mixture (which must, for these, be a single-component since the data will be for that single component in the gas phase). I think that for GasPhaseSingleMoleculeAverageEnergy, no additional info is needed beyond what's already provided by PhysicalPropertyMeasurement. However, for the bonded properties we would need some additional info:

.smarts: SMARTS pattern uniquely identifying (within the specified molecule) which bond, angle, or torsion is being examined

Any other needed properties?

A usage example would be:

dataset = GasPhaseSingleMoleculeAverageEnergy( '/Users/dmobley/phenol_gasphase.cdf4')
dataset += GasPhaseSingleMoleculeBondDistance('/Users/dmobley/phenol_gasphase.cdf4', '[#5]~[#8]') #Analyze C-O bond length
# Compute physical properties for these measurements
estimator = PropertyEstimator(nworkers=1)
computed_properties = estimator.computeProperties(dataset, parameters)
# Write out statistics about errors in computed properties
for (computed, measured) in (computed_properties, dataset):
   property_unit = measured.value.unit
   print('%24s : parm@frosst value %8.3f (%.3f) | calculated new value %8.3f (%.3f) %s' % (measured.value / property_unit, measured.uncertainty / property_unit, computed.value / property_unit, computed.uncertainty / property_unit, str(property_unit))

I'm ambivalent at this point whether GasPhaseSingleMoleculeAverageEnergy and relatives would actually analyze stored trajectory data (as I have it currently written above) or whether these would read results of analysis done separately via another tool. Input, @jchodera or others?

In all likelihood I've missed some key aspects, but does that look feasible?

davidlmobley commented 8 years ago

@mrshirts - probably this discussion should go into a separate issue, as I think what we're talking about right now is how reference isolated molecule data that we have already generated or might be generating outside the property estimator framework and get it in to our framework for use in fitting, which is a separate issue. Maybe you want to start a new issue for this, and you and I can sort it out (John doesn't need to be involved)?

To clarify: there are two types of isolated molecule data here: 1) Data we generate as reference data for fitting to, which might or might not come through the property estimator framework 2) Data we will generate with our property estimator framework to use as part of fitting to the reference data in item 1.

The API being discussed here concerns issue 2, and I think you're asking about issue 1.

My group currently has some "one-time" reference data along the lines of item 1, where we went ahead and ran isolated molecule gas phase simulations of AlkEthOH molecules with parm@frosst parameters beginning from AMBER .prmtop and .crd files. We could extract expectations from this and easily put it into this format:

mean_potential = MeanPotentialEnergy(molecule, thermodynamic_state, value=124.4*unit.kilojoules_per_mole, uncertainty=14.5*unit.kilojoules_per_mole)
bond_average = BondMoment(molecule, thermodynamic_state, value=1.52*unit.angstroms, uncertainty=0.02*unit.angstroms, moment=1, smirks='[#6:1]-[#6:2]')

I know you've been working on processing that data by atom number; I could easily generate the SMIRKS patterns which would allow us to link the atom numbers (in this one case) to SMIRKS strings. Alternatively, we could just re-run the gas phase calculations using the property estimator framework when it is online, using parm@frosst parameters as input.

To get back to this issue:

There will be no amber prmtops anymore

Aside from the reference data from Item 1 above, we'll be parameterizing molecules henceforth using the SMIRFF XML format and going directly into OpenMM, so we're no longer going to be using AMBER file formats.

OK, I'm not quite seeing the workflow anymore for generating the data for single molecules from the simulations. The .nc (or .dcd) generated by OpenMM are processed, and how are we storing each of the individual properties computed? What sort of dictionary will we use to map between this data (that is recorded by atom number) and the SMIRKS representation and when will it be generated? It's not totally clear to me the sequence. Is SMIRKS used from the beginning to analyze the .nc files by querying the atom order from the internal OpenMM representation?

Hopefully the above helped on this point. Basically right now we have this "one off" gas phase data that we generated from AMBER prmtop and crd files, and I can easily generate a dictionary which will map SMIRKS patterns to atom numbers or vise versa, just for the purposes of pulling the data we want from those simulations.

I suppose one way to do it (that I think its implied above) is to pull together and process the .nc data (by atom number), then store it in a data type that was an OEChem molecule of that type created from the .prmtop, then assigned property value to each bond (assuming the atom numbers are preserved from the .prmtop). Presumably any creation of the molecule would preserve bonds, since OpenMM is preserving the bond numbers when it creates the .nc file.

This may work, but it may also be more trouble than it's worth.

So, my suggestion is to create a separate issue where you explain exactly what you've achieved so far in terms of extracting the data, and then I can sort out the issue of the associated SMIRKS patterns there.

mrshirts commented 7 years ago

John, what specific things do you need more information on to write the API? My current feeling is that the IsolateMolecule stuff can be ignored for now -- we want to be focused on liquid phase. I think that if we need to go back and do some additional isolated molecule work, we can probably write some relatively limited case tools. I think we've decided that the applications for isolated molecules are somewhat limited and going in different directions than the liquid ones, so making a unified API isn't important.

openforcefield / open-forcefield-tools

Decide API for equilibrium expectation datasets #10