Open davidlmobley opened 8 years ago
@mrshirts - probably this discussion should go into a separate issue, as I think what we're talking about right now is how reference isolated molecule data that we have already generated or might be generating outside the property estimator framework and get it in to our framework for use in fitting, which is a separate issue. Maybe you want to start a new issue for this, and you and I can sort it out (John doesn't need to be involved)?
To clarify: there are two types of isolated molecule data here: 1) Data we generate as reference data for fitting to, which might or might not come through the property estimator framework 2) Data we will generate with our property estimator framework to use as part of fitting to the reference data in item 1.
The API being discussed here concerns issue 2, and I think you're asking about issue 1.
My group currently has some "one-time" reference data along the lines of item 1, where we went ahead and ran isolated molecule gas phase simulations of AlkEthOH molecules with parm@frosst parameters beginning from AMBER .prmtop and .crd files. We could extract expectations from this and easily put it into this format:
mean_potential = MeanPotentialEnergy(molecule, thermodynamic_state, value=124.4*unit.kilojoules_per_mole, uncertainty=14.5*unit.kilojoules_per_mole)
bond_average = BondMoment(molecule, thermodynamic_state, value=1.52*unit.angstroms, uncertainty=0.02*unit.angstroms, moment=1, smirks='[#6:1]-[#6:2]')
I know you've been working on processing that data by atom number; I could easily generate the SMIRKS patterns which would allow us to link the atom numbers (in this one case) to SMIRKS strings. Alternatively, we could just re-run the gas phase calculations using the property estimator framework when it is online, using parm@frosst parameters as input.
To get back to this issue:
There will be no amber prmtops anymore
Aside from the reference data from Item 1 above, we'll be parameterizing molecules henceforth using the SMIRFF XML format and going directly into OpenMM, so we're no longer going to be using AMBER file formats.
OK, I'm not quite seeing the workflow anymore for generating the data for single molecules from the simulations. The .nc (or .dcd) generated by OpenMM are processed, and how are we storing each of the individual properties computed? What sort of dictionary will we use to map between this data (that is recorded by atom number) and the SMIRKS representation and when will it be generated? It's not totally clear to me the sequence. Is SMIRKS used from the beginning to analyze the .nc files by querying the atom order from the internal OpenMM representation?
Hopefully the above helped on this point. Basically right now we have this "one off" gas phase data that we generated from AMBER prmtop and crd files, and I can easily generate a dictionary which will map SMIRKS patterns to atom numbers or vise versa, just for the purposes of pulling the data we want from those simulations.
I suppose one way to do it (that I think its implied above) is to pull together and process the .nc data (by atom number), then store it in a data type that was an OEChem molecule of that type created from the .prmtop, then assigned property value to each bond (assuming the atom numbers are preserved from the .prmtop). Presumably any creation of the molecule would preserve bonds, since OpenMM is preserving the bond numbers when it creates the .nc file.
This may work, but it may also be more trouble than it's worth.
So, my suggestion is to create a separate issue where you explain exactly what you've achieved so far in terms of extracting the data, and then I can sort out the issue of the associated SMIRKS patterns there.
John, what specific things do you need more information on to write the API? My current feeling is that the IsolateMolecule stuff can be ignored for now -- we want to be focused on liquid phase. I think that if we need to go back and do some additional isolated molecule work, we can probably write some relatively limited case tools. I think we've decided that the applications for isolated molecules are somewhat limited and going in different directions than the liquid ones, so making a unified API isn't important.
We need to settle how to handle non-ThermoML datasets within the
PropertyCalculator
framework, most immediately for calculation of equilibrium expectation values of other properties.As noted here ( https://github.com/open-forcefield-group/open-forcefield-tools/pull/3#issuecomment-227601790 ) the current draft API (https://github.com/jchodera/open-forcefield-tools/blob/master/README.md) being brought in via #3 envisions
PhysicalPropertyMeasurement
objects compiled into aPhysicalPropertyDataset
. One specific example is given forThermoMLDataset
as a derivative of aPhysicalPropertyDataset
, and that's certainly one of the key datasets we need in the short term.One of the other things we plan in the short term is to run equilibrium MD simulations of the AlkEthOH set in the gas phase using parm@frosst parameters and compute equilibrium expectations - energies (total as well as possibly divided into components from individual bonds, angles, and torsions) and bond, angle, and torsional distributions. We would then use this (or subsets thereof) as a dataset for parameterization to see if we can recover the parameter set. So, we need to decide how this would be framed in terms of a dataset - either a
PhysicalPropertyDataset
or something else.It seems like the description of the physical system we're interested in could fit within the existing framework of a
PhysicalPropertyMeasurement
, i.e.:Mixture
specifying the substance that the measurement was performed on: Here, this would be a single component mixtureThermodynamicState
specifying the thermodynamic conditions under which the measurement was performed: Here this would specify gas phase and the particular temperature.PhysicalProperty
is the physical property that was measured: Here we would have multiple such measurements and the only significant change would arise here. Examples below for my case.MeasurementMethod
specifying the kind of measurement that was performed: Here this would beEquilibriumMDSimulation
as the source of our equilibrium expectations.As noted elsewhere, each
PhysicalPropertyMeasurement
has several properties:.substance
- theMixture
for which the measurement was made.thermodynamic_state
- theThermodynamicState
.value
- the unit-bearing measurement value (it is not explicitly stated in the API docs, but I assume this would be, for example, aPhysicalProperty such as
ExcessMolarEnthalpy`, @jchodera ?).uncertainty
- the standard uncertainty of the measurement.reference
- the literature reference (if present) for the measurement.DOI
- the literature reference DOI (if available) for the measurementIf we were to extend the API to allow equilibrium expectations from the envisioned gas phase MD simulations, our crucial need would be for suitable
PhysicalProperty
types. Probably we would need something like:GasPhaseSingleMoleculeAverageEnergy
GasPhaseSingleMoleculeBondDistance
GasPhaseSingleMoleculeBondAngle
GasPhaseSingleMoleculeTorsionAngle
These would look for a definition of which molecule to be examined in the
Mixture
(which must, for these, be a single-component since the data will be for that single component in the gas phase). I think that forGasPhaseSingleMoleculeAverageEnergy
, no additional info is needed beyond what's already provided byPhysicalPropertyMeasurement
. However, for the bonded properties we would need some additional info:.smarts
: SMARTS pattern uniquely identifying (within the specified molecule) which bond, angle, or torsion is being examinedAny other needed properties?
A usage example would be:
I'm ambivalent at this point whether
GasPhaseSingleMoleculeAverageEnergy
and relatives would actually analyze stored trajectory data (as I have it currently written above) or whether these would read results of analysis done separately via another tool. Input, @jchodera or others?In all likelihood I've missed some key aspects, but does that look feasible?