Decide API for equilibrium expectation datasets

davidlmobley commented 8 years ago

We need to settle how to handle non-ThermoML datasets within the PropertyCalculator framework, most immediately for calculation of equilibrium expectation values of other properties.

As noted here ( https://github.com/open-forcefield-group/open-forcefield-tools/pull/3#issuecomment-227601790 ) the current draft API (https://github.com/jchodera/open-forcefield-tools/blob/master/README.md) being brought in via #3 envisions PhysicalPropertyMeasurement objects compiled into a PhysicalPropertyDataset. One specific example is given for ThermoMLDataset as a derivative of a PhysicalPropertyDataset, and that's certainly one of the key datasets we need in the short term.

One of the other things we plan in the short term is to run equilibrium MD simulations of the AlkEthOH set in the gas phase using parm@frosst parameters and compute equilibrium expectations - energies (total as well as possibly divided into components from individual bonds, angles, and torsions) and bond, angle, and torsional distributions. We would then use this (or subsets thereof) as a dataset for parameterization to see if we can recover the parameter set. So, we need to decide how this would be framed in terms of a dataset - either a PhysicalPropertyDataset or something else.

It seems like the description of the physical system we're interested in could fit within the existing framework of a PhysicalPropertyMeasurement, i.e.:

A Mixture specifying the substance that the measurement was performed on: Here, this would be a single component mixture
A ThermodynamicState specifying the thermodynamic conditions under which the measurement was performed: Here this would specify gas phase and the particular temperature.
A PhysicalProperty is the physical property that was measured: Here we would have multiple such measurements and the only significant change would arise here. Examples below for my case.
A MeasurementMethod specifying the kind of measurement that was performed: Here this would be EquilibriumMDSimulation as the source of our equilibrium expectations.

As noted elsewhere, each PhysicalPropertyMeasurement has several properties:

.substance - the Mixture for which the measurement was made
.thermodynamic_state - the ThermodynamicState
.value - the unit-bearing measurement value (it is not explicitly stated in the API docs, but I assume this would be, for example, a PhysicalProperty such asExcessMolarEnthalpy`, @jchodera ?)
.uncertainty - the standard uncertainty of the measurement
.reference - the literature reference (if present) for the measurement
.DOI - the literature reference DOI (if available) for the measurement

If we were to extend the API to allow equilibrium expectations from the envisioned gas phase MD simulations, our crucial need would be for suitable PhysicalProperty types. Probably we would need something like:

GasPhaseSingleMoleculeAverageEnergy
GasPhaseSingleMoleculeBondDistance
GasPhaseSingleMoleculeBondAngle
GasPhaseSingleMoleculeTorsionAngle

These would look for a definition of which molecule to be examined in the Mixture (which must, for these, be a single-component since the data will be for that single component in the gas phase). I think that for GasPhaseSingleMoleculeAverageEnergy, no additional info is needed beyond what's already provided by PhysicalPropertyMeasurement. However, for the bonded properties we would need some additional info:

.smarts: SMARTS pattern uniquely identifying (within the specified molecule) which bond, angle, or torsion is being examined

Any other needed properties?

A usage example would be:

dataset = GasPhaseSingleMoleculeAverageEnergy( '/Users/dmobley/phenol_gasphase.cdf4')
dataset += GasPhaseSingleMoleculeBondDistance('/Users/dmobley/phenol_gasphase.cdf4', '[#5]~[#8]') #Analyze C-O bond length
# Compute physical properties for these measurements
estimator = PropertyEstimator(nworkers=1)
computed_properties = estimator.computeProperties(dataset, parameters)
# Write out statistics about errors in computed properties
for (computed, measured) in (computed_properties, dataset):
   property_unit = measured.value.unit
   print('%24s : parm@frosst value %8.3f (%.3f) | calculated new value %8.3f (%.3f) %s' % (measured.value / property_unit, measured.uncertainty / property_unit, computed.value / property_unit, computed.uncertainty / property_unit, str(property_unit))

I'm ambivalent at this point whether GasPhaseSingleMoleculeAverageEnergy and relatives would actually analyze stored trajectory data (as I have it currently written above) or whether these would read results of analysis done separately via another tool. Input, @jchodera or others?

In all likelihood I've missed some key aspects, but does that look feasible?

davidlmobley commented 8 years ago

@mrshirts , do you want to weigh in on this one at all?

mrshirts commented 8 years ago

A PhysicalProperty is the physical property that was measured: Here we would have multiple such measurements

Couldn't this hold for datasets as well? There could be multiple measurements at the same conditions.

davidlmobley commented 8 years ago

Couldn't this hold for datasets as well? There could be multiple measurements at the same conditions.

There is already the infrastructure in place in the API for i.e. thermoML to specify multiple measurements, which I think is what you're asking about.

mrshirts commented 8 years ago

Can't we just have: ThermodynamicState = GasPhase PhysicalProperty = SingleMoleculeAverageEnergy MeasurementMethod = EquilibriumMD

Do we need to have more complicated PhysicalProperty objects like: GasPhaseSingleMoleculeAverageEnergy

davidlmobley commented 8 years ago

@mrshirts - that may work. I was having trouble convincing myself that "gas phase" really was something that belonged in the ThermodynamicState; of course it does at some level, but for condensed phase properties we'll want to specify a meaningful pressure and temperature there. I guess we could make P=0 trigger "special case" gas phase behavior or something. Is that what you're thinking? Certainly, compared to most properties we'll be interested in, gas phase is a special case, so a key question is how to best recognize that...

mrshirts commented 8 years ago

I think that "phase" can be an additional information beyond just T and P. For example, in crystal polymorphs, you can have different polymorphs that are metastable at the same T and P. So it's not weird to make the phase a different quantity. Or for polymers, you can have different phases (gyroid, lamellar, etc).

Ideal gas could be a separate phase -- that's a imaginary phase that is approached by real systems when T->inf and P->0, but it might make sense to call it something separate. QM calculations are pretty good in ideal gas for heat capacities. . .

davidlmobley commented 8 years ago

OK, @mrshirts . Can you then propose (ideally via a pull request to modify the README.md) what needs to be in the API draft in the README.md here: https://github.com/open-forcefield-group/open-forcefield-tools to handle ThermodynamicState info of this sort? i.e., how specifically would we store this type of info in ThermodynamicState? John is planning to implement tomorrow and this weekend, so if we need it in there we have to codify how exactly.

Right now his example of ThermodynamicState is "298 Kelvin, 1 atmosphere", so we're saying ThermodynamicState contains a temperature and pressure. If we want to generalize it to the cases we mentioned (to specify gas phase, or potentially to specify a particular polymorph down the road) what additional information does it need to have and how exactly should it be stored?

I think what you're proposing right now is that ThermodynamicState just have an additional property aside from .temperature and .pressure which would be .phase which would be a ____ (?? text string?) describing the phase of the system. Though, to accommodate the gas phase MD simulation data we'd be using we'd also need to specify that if I say I want a .phase='gas' simulation of methane, I don't mean methane GAS, I mean a single molecule of methane in the gas phase. How does that fit in to your idea of ThermodynamicState? How would we say we want only one molecule?

The reason I proposed the SingleMoleculeGasPhase... stuff was that I thought it involves two special cases at once:

we want to simulate only a single molecule
we want to simulate in the gas phase

and neither of those fits neatly into our definition of what we "normally" want to simulate here (typically a liquid mixture), so the combination of the two is sort of a "very special" case which I thought meant it needed to be carefully broken out for special treatment. Treating them as a special type of PhysicalPropertyMeasurement allows us to avoid having to introduce a lot of new concepts into the definition of a ThermodynamicState and just pick these up because we run into a special case type of measurement.

To put it another way - I'm all for making ThermodynamicState more general when it's going to help us cover more ground and use cases down the line, but I'm not that enthusiastic about loading up more information there only to help us handle weird special cases which are not representative of general use cases. So I'm trying to sort out which territory we're in here - will putting more info in ThermodynamicState for THIS problem help us later, or is it just loading up info we aren't going to be using?

davidlmobley commented 8 years ago

Also, on this part, @mrshirts :

Ideal gas could be a separate phase -- that's a imaginary phase that is approached by real systems when T->inf and P->0, but it might make sense to call it something separate. QM calculations are pretty good in ideal gas for heat capacities. . .

Just to clarify again - we're not here talking about gas phase, we're talking about a simulation of a single molecule of methane, or cyclohexane, or whatever, in the gas phase. I agree that if I were talking about gas phase simulations, it might make sense to include this info in the ThermodynamicState (and indeed, we should expand ThermdynamicState to include those when/if we get to where we want to simulate such systems) but right now I'm talking about simulating single molecules in the gas phase, which is a weird corner case in my view.

jchodera commented 8 years ago

Let's not change ThermodynamicState to include the concept of gas phase. One has really no bearing on the other.

We want to replace Mixture (a container for specifying a liquid mixture system) with something that has a concept of a single molecule in gas phase. Can you come up with something like that?

jchodera commented 8 years ago

Maybe IsolatedMolecule, since even IdealGas implies some finite (but noninteracting) concentration ?

davidlmobley commented 8 years ago

@jchodera

We want to replace Mixture (a container for specifying a liquid mixture system) with something that has a concept of a single molecule in gas phase. Can you come up with something like that?

Ah, that would work well. So a MeasuredPhysicalProperty would take a Composition or Substance which could be a Mixture specifying a liquid (pure or mixed) or other types to be determined later, such as a Gas (later) or an IsolatedMolecule (now). Seems perfect.

davidlmobley commented 8 years ago

(note I don't think we want to use IdealGas anywhere, since that's a theoretical construction. I updated my comment above to reflect this.)

davidlmobley commented 8 years ago

So, then, we'd be able to do as I suggested above but generalize so that we'd have something like:

AverageEnergy
BondDistance
BondAngle
TorsionAngle

as properties we would want to calculate for single molecules (and probably later, for molecules in mixtures, etc., so now we kill two birds with one stone).

These would look for a definition of which molecule to be examined in the Composition or Substance. For the bonded properties we would need some additional info, even for the IsolatedMolecule case, specifically:

.smarts: SMARTS pattern uniquely identifying (within the specified molecule) which bond, angle, or torsion is being examined

More generally for a Mixture we would probably want the property calculator just to return a measurement for EACH molecule matching the specified smirks, separately (i.e. if I specify a C-C bond distance and I'm looking at a mixture of cyclohexane and ethanol, I want to be able to look at the two distributions separately).

@jchodera - if this sounds reasonable to you I can put together a PR to update the README.md accordingly. This is a much better solution than what I'd come up with.

mrshirts commented 8 years ago

Let's not change ThermodynamicState to include the concept of gas phase. One has really no bearing on the other.

Well, I'd argue the thermodynamics state includes the phase (see polymorphs comment above) Whether the ThermodynamicState does may be a different question.

since even IdealGas implies some finite (but noninteracting) concentration ?

The per-molecule residual properties would be identically the same (i.e. properties minus the ideal gas term), so if T and P are specified, the properties could be mapped one-to-one with no error. So 1 molecule of an ideal gas could be identified with an isolated molecule; though it's not. It is a theoretical construction, but a pretty well-defined one.

Would be relatively easy to fix up later if needed, and operationally, wouldn't really be difference either way. So I don't really care that much on that in the end :)

A gas can be a mixture just as easily well as a liquid, so maybe something like just Liquid instead of . You can tell if something is pure if it's has a single component.

So we could have Liquid, Gas, IsolatedMolecule, and potentially Solid.

Though again, solids (both metals and polymers) do include polymorphs that have different properties, and it would probably be useful to think about how that might fit in somewhere (even if that is down the road).

jchodera commented 8 years ago

Well, I'd argue the thermodynamics state includes the phase (see polymorphs comment above) Whether the ThermodynamicState does may be a different question.

The probability that a given phase or mixture of phases exists at equilibrium is certainly controlled by the thermodynamic state, but @davidlmobley is looking for a way to encode the information "this measurement was made for an isolated molecule in vacuum". That should be the job of the container object specifying the system composition.

We had previously defined a Mixture (to which we presumably intend to attach utility functions that generated initial conditions for a mixture in the liquid state), so it seems appropriate to define a different container that means "this measurement was made for an isolated molecule in vacuum".

Alternatively, we can separate the container object from an object that prepares initial conditions if you want to use the Mixture container along with a LiquidInitialConditions and IsolatedMoleculeInitialConditions, but this can get weird because not all Mixture objects would be supported by all initial conditions.

The per-molecule residual properties would be identically the same (i.e. properties minus the ideal gas term), so if T and P are specified, the properties could be mapped one-to-one with no error. So 1 molecule of an ideal gas could be identified with an isolated molecule; though it's not. It is a theoretical construction, but a pretty well-defined one.

Good point, but I am not sure we will be dealing with ideal gas mixture measurements, and I am not sure we gain anything by building in the stat mech of ideal gases when our current use case doesn't deal with that.

Would be relatively easy to fix up later if needed, and operationally, wouldn't really be difference either way. So I don't really care that much on that in the end :)

OK

Though again, solids (both metals and polymers) do include polymorphs that have different properties, and it would probably be useful to think about how that might fit in somewhere (even if that is down the road).

Are you mostly worried about solid crystals or amorphous solids?

In the current scheme, we would either keep these as Mixture objects (though for crystals we would need precise stoichiometries, or maybe chemical potentials? if simulating the crystal growth process) so would not want to use mole fractions; or we could define a different AmorphousSolidMixture or Crystal or something. This does conflate phase with composition, but given how hard it is to a sample across phase boundaries, that may not be a big issue.

The more important consideration is this: Given that we wanted to make it as easy as possible to use ThermoML data as input, it is most convenient if there is a one-to-one correspondence between ThermoML information and Python objects. @davidlmobley is already trying to break this by using data sources not in ThermoML, and we are trying to accommodate that without destroying the whole design. Is there a better way to do this that will keep the code very simple conceptually and technically?

For example, could @davidlmobley generate synthetic ThermoML files where we simply extend the set of measurement types in a consistent way? That would have the least impact on the design.

mrshirts commented 8 years ago

One additional thing I thought of; above a critical point, there not really a dividing line between Gas and Liquid. This is generally handled by calling it a fluid -- mostly, there is only one experimentally accessible system at a given T and P that is a fluid (superheated or supercooled fluids are probably not going to be parameterized).

At this point, I'm just thinking through concepts to see what makes logical sense or not. This leads me back to the add an attribute .phase, which could be isolated molecule, liquid, gas, lamellar, etc. But there are drawbacks as well.

In the current scheme, we would either keep these as Mixture objects (though for crystals we would need precise stoichiometries, or maybe chemical potentials? if simulating the crystal growth process) so would not want to use mole fractions;

Stochiometries won't work, since there can be multiple phases with the same composition (especially for pure, but can happen for mixtures).

jchodera commented 8 years ago

Stochiometries won't work, since there can be multiple phases with the same composition (especially for pure, but can happen for mixtures).

I just meant for a crystal, we really do need to simulate the correct number of molecules in the unit cell; we can't be off by a molecule due to a rounding error or else we won't get correct crystal packing.

davidlmobley commented 8 years ago

@jchodera :

The more important consideration is this: Given that we wanted to make it as easy as possible to use ThermoML data as input, it is most convenient if there is a one-to-one correspondence between ThermoML information and Python objects. @davidlmobley is already trying to break this by using data sources not in ThermoML, and we are trying to accommodate that without destroying the whole design. Is there a better way to do this that will keep the code very simple conceptually and technically?

Presumably down the line we will be interested in lots of other data not in ThermoML, such as log D data, other measurements we might make directly ourselves, and Gilson will probably want to try including host-guest binding data (presumably we wouldn't have that as part of our main effort but I'm imagining he's going to want to try it in his group anyway to see what happens). So, I'm all for a clear correspondence where each type of ThermoML information corresponds to a specific Python object, but I think we have to be clear at this point that we're going to have Python objects which DON'T correspond to ThermoML information. Otherwise we're effectively putting ourselves in a world where we say, "well, you can parameterize using any type of data you want... but it has to be data in ThermoML", which is not going to fly well if we are trying to convince others (such as the pharma industry) that this is a framework they can pick up and use however they want, potentially even using their own internal data in new parameterization efforts.

So, to reiterate: each type of data in ThermoML should correspond to a Python object, but not all of our data types will correspond to types of data in ThermoML. This has to be a design consideration here or we won't have nearly enough flexibility.

For example, could @davidlmobley generate synthetic ThermoML files where we simply extend the set of measurement types in a consistent way? That would have the least impact on the design.

Maybe I'm missing something here, but naively this approach seems to me like an incredible hack -- are you saying that, to avoid having to accommodate non-ThermoML data in our framework, we should basically generate "fake" ThermoML datasets/data types for single molecule gas phase measurements? I'm missing what's beneficial about this. Already our property estimator is going to have to be able to recognize (at least when we need it to!) each type of data available in ThermoML when we want to calculate the relevant property -- presumably, this will be some kind of conditional leading to code for computing that property. I am missing why it's better to have the conditional recognize a ThermoML dataset type corresponding to single molecule gas phase data rather than to have the conditional recognize a non-ThermoML dataset type corresponding to single molecule gas phase data...

@davidlmobley is looking for a way to encode the information "this measurement was made for an isolated molecule in vacuum". That should be the job of the container object specifying the system composition.

I very much agree with this. In fact I THOUGHT this resolved the whole issue as I explained in this comment: https://github.com/open-forcefield-group/open-forcefield-tools/issues/10#issuecomment-229210542

Should I re-write the Issue that started this off using the idea from that last comment so you can see what I'm imagining this code would look like?

jchodera commented 8 years ago

Just to comment on the ThermoML aspect:

ThermoML is a really nice format for storing data because it IS standardized. There may be advantages in trying to standardize the community around this format.

If current datasets are not in ThermoML format, it is almost as easy to write a converter to cram the data into ThermoML format as it is to cram it into some internal Python container representation. So why not persuade people to put it in ThermoML format?

If there is data that is currently not supported by the ThermoML format, we get to help drive the extension of the format to accommodate this data.

If one of our main goals is to help define and encourage the adoption of community standards for data sharing that make it easier to collect machine-readable datasets and parameterize forcefields from them, I see a lot of advantages to using this as a carrot: These tools will work automatically on data, as long as it is in the ThermoML standard. It need not be in the Archive, but getting it into the ThemoML format is 90% of the way to sharing it.

davidlmobley commented 8 years ago

Light bulb comes on, @jchodera :

ThermoML is a really nice format for storing data because it IS standardized. There may be advantages in trying to standardize the community around this format.

I see. So you are NOT talking about wanting all the datasets to be in ThermoML or ThermoML datasets, but in ThermoML FORMAT.

If current datasets are not in ThermoML format, it is almost as easy to write a converter to cram the data into ThermoML format as it is to cram it into some internal Python container representation. So why not persuade people to put it in ThermoML format?

OK, so I need to look at the format spec and see if I can think of anything I see us wanting to do that wouldn't fit. Do you have a link to the spec handy? I'll google momentarily.

In any case, now I finally get why you are attached to ThermoMLDataset - it's not because you want ThermoML datasets, it's because you want ThermoML format datasets. Very different. :)

davidlmobley commented 8 years ago

Side note - I need to create a PR to modify the API to clarify that ThermoMLDataset is NOT necessarily a ThermoML dataset but a ThermoML format dataset. :)

davidlmobley commented 8 years ago

Format for ThermoML is here: http://trc.nist.gov/ThermoMLRecommendations.pdf . 72 pages.

davidlmobley commented 8 years ago

Gaah, this format document is complex enough I'm not going to be able to quickly reach any useful conclusion about whether things we want would all fit. Superficially it looks like this basically can be framed to hold any type of average property we want via the right type of extensions.

The one thing that I could see possibly being a problem is that Christopher is very interested in being able to (ultimately) work with distributions rather than just averages. To give a very simple example, imagine we have collected data from gas phase simulations of, say, some alkanes and we observe a particular torsional distribution has two peaks. If we are going to fit this torsion, we presumably need to end up with not just the average value of the torsion, but info on the distribution of torsion angles. So I think his view is that ultimately we'll need to be able to work with and fit to distribution data. But, maybe this still fits - i.e. equilibrium populations in particular torsional bins?

jchodera commented 8 years ago

In any case, now I finally get why you are attached to ThermoMLDataset - it's not because you want ThermoML datasets, it's because you want ThermoML format datasets. Very different. :)

Yes! For convenience, we will make it very easy to grab these datasets from the NIST ThermoML Archive, but you will also be able to specify a directory or URL with ThermoML format files in it instead.

Format for ThermoML is here: http://trc.nist.gov/ThermoMLRecommendations.pdf . 72 pages.

There have been a few revisions to the format---see this page for the latest pointers to the format updates. Kenneth Kroenlein is in the middle of a new set of revisions, which makes it all the more timely for @bmanubay and @mrshirts to work with them to correct deficiencies or enhance the format if needed.

jchodera commented 8 years ago

Gaah, this format document is complex enough I'm not going to be able to quickly reach any useful conclusion about whether things we want would all fit.

The best thing to do is to look through some sample XML files. You can use thermopyl to grab a local copy of some files:

conda install -c choderalab thermopyl
thermoml-update-mirror

davidlmobley commented 8 years ago

OK, so @jchodera - I am totally OK with everything being a ThermoML-format dataset as long as you think it can hold everything we need. Can you think of any obvious types of things it CAN'T hold?

davidlmobley commented 8 years ago

@jchodera - I've poked a little bit at XML files and am starting to get the picture. So as long as it's something we can frame as a named property which has a single value and associated uncertainty (i.e in ePropName for the name of the property) we should be OK. Is that right?

The issue above about populations/histogram type info is probably relevant, but maybe this can be fit into the ThermoML dataset framework as long as we're allowed to be liberal in naming things, such as properties of particular torsional bins if we got to that point.

(Aside from my silly example of why we might look at torsional populations, we'll also at some point have to do torsion drives in this framework presumably, so we have to think about how that might fit into this format...)

Offline for a couple of hours beginning shortly.

jchodera commented 8 years ago

An additional API question:

If we're being asked to compute multiple physical properties for multiple parmeter sets, how do we want to return the data? My original API example just illustrated retrieving a list of computed properties that correspond to measurements in a PhysicalPropertyDataset, but for a single input ParameterSet:

# Attach to my compute and storage resources
estimator = PropertyEstimator(...)
# Estimate some properties
computed_properties = estimator.computeProperties(dataset, parameters)
# Get statistics about simulation data that went into each property
for property in computed_properties:
   # Get statistics about simulation data that was reweighted to estimate this property
   value = property.value
   ...

In the case of multiple parameter sets, do we want to return a list of lists? Or do we want to add another container type, and return a list of ComputedPhysicalPropertyDataset objects?

davidlmobley commented 8 years ago

@jchodera : I actually thought of this when extending it to multiple parameter sets, but realized that your existing framework seems to handle this case reasonably well at least from my perspective. Particularly:

PropertyEstimator.computeProperties(...) returns a list of ComputedPhysicalProperty objects that provide access to several pieces of information:

property.value - the computed property value, with appropriate units

...

property.parameters - a reference to the parameter set used to compute this property

...

So, as long as we get back a list of properties, and each has a particular parameter set tied to it, there's no ambiguity, in that if we have any questions about which entry in the list goes with which parameter set, that information is directly tied to the entries themselves.

I don't have any problem with a list of lists if you prefer that, but even a list of the results would be unambiguous. And the docs could clarify whether this list is ordered by properties and then parameter set, or parameter set and then property.

It seems unnecessary to have another container.

Your call though.

davidlmobley commented 8 years ago

@jchodera : As you get to working more on the property calculation framework this weekend, I just wanted to reawaken the "equilibrium expectation" issue on your radar, as the very first property we'll be trying to parameterize to (really, as soon as we have the framework in place to do so!) is equilibrium expectations extracted from simulations with parm@frosst.

The one big concern I have at this point is that I want to make sure we know how we'll fit data on distributions into the ThermoML format (or at least know that we can without too much pain). Particularly, we'll need to be able to simulate with parm@frosst and extract and store (presumably to an extended ThermoML format) information on the distribution of torsions sampled, not just the average value for a particular torsion. Otherwise, if we simulate a molecule which has a torsion with, say, two equally populated peaks at +/-90 degrees, we'll be able to fit it equally well by (a) a torsion with a single peak at zero degrees, or (b) a torsion with two peaks at +/-N degrees, or (c) a torsion with three peaks at zero, +/-N, or (d)... (e) ... (f)...

The other observables we have in mind (bond, angle, energy, etc.) this should not be as much of an issue for, as we expect these often to basically be unimodal so the mean captures much of the key information (though the standard deviation might perhaps also be useful). But since torsions are periodic, the mean is not a particularly useful quantity and we'll need to be able to work with distribution info, probably for particular torsional bins we would define.

mrshirts commented 8 years ago

On page 592-600ish there's a discussion of representing ThermoML properties as equations using MathML I haven't fully investigated it yet, but it could be possible represent distributions as Fourier transforms to some appropriate periodicity.

On Fri, Jul 8, 2016 at 5:55 PM, David L. Mobley notifications@github.com wrote:

@jchodera https://github.com/jchodera : As you get to working more on the property calculation framework this weekend, I just wanted to reawaken the "equilibrium expectation" issue on your radar, as the very first property we'll be trying to parameterize to (really, as soon as we have the framework in place to do so!) is equilibrium expectations extracted from simulations with parm@frosst.

The one big concern I have at this point is that I want to make sure we know how we'll fit data on distributions into the ThermoML format (or at least know that we can without too much pain). Particularly, we'll need to be able to simulate with parm@frosst and extract and store (presumably to an extended ThermoML format) information on the distribution of torsions sampled, not just the average value for a particular torsion. Otherwise, if we simulate a molecule which has a torsion with, say, two equally populated peaks at +/-90 degrees, we'll be able to fit it equally well by (a) a torsion with a single peak at zero degrees, or (b) a torsion with two peaks at +/-N degrees, or (c) a torsion with three peaks at zero, +/-N, or (d)... (e) ... (f)...

The other observables we have in mind (bond, angle, energy, etc.) this should not be as much of an issue for, as we expect these often to basically be unimodal so the mean captures much of the key information (though the standard deviation might perhaps also be useful). But since torsions are periodic, the mean is not a particularly useful quantity and we'll need to be able to work with distribution info, probably for particular torsional bins we would define.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/issues/10#issuecomment-231499118, or mute the thread https://github.com/notifications/unsubscribe/AEE31DGOde-v3rcdc_Jg2S1_XF9RXqv2ks5qTuNrgaJpZM4I8MWR .

jchodera commented 8 years ago

I'm with @mrshirts that we want to avoid histograms. I think these will only cause us headaches.

What about expectations of spherical harmonics or some other periodic function?

jchodera commented 8 years ago

Or circular moments?

davidlmobley commented 8 years ago

I like the idea of something like a cosine expansion; I was not familiar with the terminology "circular moments" but those would also do nicely. The key point is that we can't get by with only a simple mean of a torsion, but we can get by with a simple mean of something which is periodic.

I wonder if there are related issues that come up in any other observables we would be interested in? Torsions was the only thing where this was obvious to me. I suppose it could be an issue for bonds if we ever went to a potential more like a Morse potential or something which is decidedly anharmonic. But, again, this should fit well into a framework which would allow circular moments, fourier transforms or similar.

mrshirts commented 8 years ago

Morse potential or something which is decidedly anharmonic. But, again, this should fit well into a framework which would allow circular moments, fourier transforms or similar.

It seems you can represent observations as any object in MathML (again, we'll have to look into this a bit more closely)

If we are collecting a finite amount of data, the Fourier series representation with a decent number of terms (10-ish) should represent most functions with a pretty good amount of fidelity.

davidlmobley commented 8 years ago

@jchodera asked for a reminder of what IsolatedMolecule data we would want to deal with (i.e. for our first test of parameterizing , so here's a reminder:

Initially (simplest test), averages of:

energy
bond distance for specified bonds
angle for specified angle
torsion for specified torsion

Very soon:

Standard deviations of the above (since we can't fit a force constant without the width of a distribution)

Soon:

Fourier series representation for torsions instead of average/standard deviation (since averages and standard deviations only provide enough info for things which are unimodal and symmetric)

Later:

More complicated specifications for the above, like pull out all of the bond/angle/torsional information for BATs for all molecules /cases matching the specified SMIRKS in the specified system

jchodera commented 8 years ago

I've got to the point where I need this API to be nailed down, since you indicated you need this first.

Going back to @davidlmobley's proposed API:

dataset = GasPhaseSingleMoleculeAverageEnergy( '/Users/dmobley/phenol_gasphase.cdf4')
dataset += GasPhaseSingleMoleculeBondDistance('/Users/dmobley/phenol_gasphase.cdf4', '[#5]~[#8]') #Analyze C-O bond length
# Compute physical properties for these measurements
estimator = PropertyEstimator(nworkers=1)
computed_properties = estimator.computeProperties(dataset, parameters)
# Write out statistics about errors in computed properties
for (computed, measured) in (computed_properties, dataset):
   property_unit = measured.value.unit
   print('%24s : parm@frosst value %8.3f (%.3f) | calculated new value %8.3f (%.3f) %s' % (measured.value / property_unit, measured.uncertainty / property_unit, computed.value / property_unit, computed.uncertainty / property_unit, str(property_unit))

Some questions:

Do you really want to use GasPhase here, or are we calling it IsolatedMolecule now?
Do you really want to have a ton of measured properties, one for each valence term and moment order, which would create dozens and dozens of things like
- GasPhaseSingleMoleculeBondDistance
- GasPhaseSingleMoleculeBondDistanceStandardDeviation
- GasPhaseSingleMoleculeBondDistanceThirdMoment
- GasPhaseSingleMoleculeAverageAngle
- etc.

or would it make more sense to have a single function IsolatedMoleculeValenceMoment that takes a type='bond' argument and an order=2 for the order n of the moment E[x^n] that should be computed?

Do we want moments E[x^n] or central moments E[(x-E[x])^n]?
Do we really want to use SMARTS strings to match? What happens if multiple valence terms match?

I'll note that the changes required for this are really making everything much more complicated, since they are such a departure from the original planned API.

davidlmobley commented 8 years ago

@jchodera:

Yes, we're calling it IsolatedMolecule, not GasPhase. And, I think we actually said that it might be helpful to have IsolatedMolecule tied to the composition/substance/mixture rather than to the name of the property, since then one doesn't have to separately implement ValenceMoment for isolated molecules versus for molecules in solution when these are really the same thing.

And yes, I now agree with you that we don't want things like GasPhaseSingleMoleculeBondDistance. Basically we want just a framework that can handle averages of properties. So yes, ValenceMoment would work well (or if for some reason it doesn't make sense to have IsolatedMolecule tied to the composition, we could do IsolatedMoleculeValenceMoment.

Do we want moments E[x^n] or central moments E[(x-E[x])^n]?

Seems like central moments, actually. We'll want the mean, and then the standard deviation.

Perhaps it is not yet time to consider it, but for torsions we'll also want (eventually) a fourier series representation, as noted elsewhere.

Do we really want to use SMARTS strings to match? What happens if multiple valence terms match?

Maybe there would be other ways to achieve the same effect other than SMARTS/SMIRKS. But my thinking here is that we're going to be using this to do things like "pull out all the carbon-carbon bond distances and standard deviations for this molecule" (since we have one set of parameters for those) so we can fit to them. So we're going to want to get them all in one shot, and it would be nice to do it by pulling the same SMIRKS descriptors from the FFXML file - i.e. we loop over parameters from the FFXML file, and for each one we use the SMIRKS string to pull all of the relevant data from our simulations.

If multiple terms match, my thought was that we would (a) get the data aggregated across all matching terms, and (b) optionally get back the data organized by specific match, i.e. as a list of [(atomnumbers, average), (atomnumbers, average), ...]. I'm OK to adjustments to this though - the main point though is that in this case we're going to be looking at aggregated data across our entire set that provides info on C-C bond lengths (for example), and it seems like it would be most convenient just to be able to pull that by specifying the relevant SMIRKS. I'm ambivalent about whether the aggregation occurs before or after pulling the data though.

I'll note that the changes required for this are really making everything much more complicated, since they are such a departure from the original planned API.

I agree this is different than most of the data we're talking about using, but it IS the data it makes most sense for us to fit to first. If you can think of a better way to handle it I'm certainly open to suggestions.

mrshirts commented 8 years ago

Do you really want to have a ton of measured properties, one for each valence term and moment order . . . or would it make more sense to have a single function IsolatedMoleculeValenceMoment that takes a type='bond' argument and an order=2 for the order n of the moment E[x^n] that should be computed?

It depends on how we want these stored. If they are stored as ThermoML entries, then we can't have extra functions -- it's not necessarily the end of the world if we have a bunch of properties.

If we have ThermoML-parallel data structures, then of course we can have extra functions, and the data can be stored in them however we want. We'll need to be a little careful about how to define the 'bond' argument -- will it just be a pair of atom integers? I'm not exactly what other pattern we could use. We'll need to link it somehow to the molecular structure (in whichever format is appropriate -- it starts out in AMBER .prmtop formats, but we can convert it and use some other format.

Do we want moments E[x^n] or central moments E[(x-E[x])^n]?

We need the first moment, since we need the mean. That's not any of the central moments. We need the 2nd central moment (the variance, sigma^2), though sigma is fine. For torsions, we want a Fourier expansion -- moments won't work well with multimodal functions.

Do we really want to use SMARTS strings to match? What happens if multiple valence terms match?

I agree with David's comments here (if I understand them correctly). We want multiple valence terms to match so we can decide if they should be lumped together or not.

I'll note that the changes required for this are really making everything much more complicated, since they are such a departure from the original planned API.

I'd advocate for trying to keep the API simpler, and us handling the departures with separate interfacing functions. One example, just giving up and having a bunch of weird property names IsolatedMoleculeBondMeanMolecule734Atom1Atom10), since we won't do something like this again with the main API -- if there's a few hundred lines of special use code, I would see that as an OK tradeoff for not having to have too complicated an API.

jchodera commented 8 years ago

I fear this is turning into a really vaguely-specified feature request that we're trying to shoehorn into an API design for something else entirely. We'll try a bit more to iterate the design to something concrete, but this is going to significantly slow us down by not having something specific I can implement.

e. as a list of [(atomnumbers, average), (atomnumbers, average), ...].

The atom numbers will be entirely meaningless unless we make the only way of specifying the IsolatedMolecule system be to provide an OEMol object. That is, instead of the analog of

mixture = Mixture()
mixture.addComponent('methanol', mole_fraction=0.5)
mixture.addComponent('water')

which is

ethane = IsolatedMolecule(iupac='ethane')
methanol = IsolatedMolecule(smiles='CO')

we would require you provide

molecule = IsolatedMolecule(oemol)

Be sure to think through the ramifications of all that. Do you really want to have to construct these from OEMol objects that you store serialized alongside the computed valence property averages in order to ensure deterministic atom ordering?

davidlmobley commented 8 years ago

@jchodera - that's a good point relating to OEMols and atom numbering. I don't think it makes sense to have the only way that these be initialized be an OEMol, which leads me back to my preference for SMARTS/SMIRKS for the relevant properties.

Are there any other options you think are reasonable? I just don't see any, because we really want something that allows us to say "pull the data on that valence parameter" without having to worry about atom ordering, etc.

Allowing specifications of SMARTS/SMIRKS and returning properties by those will give us all the flexibility I think we'll need - i.e. if we're looking at CURRENT parameters, then we just query via our current SMARTS/SMIRKS, but if we want to distinguish (later) whether a new environment might be warranted, we can query based on SMIRKS which are derivatives of our current ones (i.e. if we want to see if the data would warrant two C-C bond types, we can query based on what we're proposing those to be).

davidlmobley commented 8 years ago

@jchodera - would it help if I write down an updated usage example here, and then also do a PR into the README.md with what I am proposing?

jchodera commented 8 years ago

Here's my current thinking for API:

For testing against synthetic data, we also make use of isolatedMolecule objects (also subclasses of Substance) that represent a single molecule:

phenol = IsolatedMolecule(iupac='phenol')
ethane = IsolatedMolecule(smiles='CC')

The various properties are all subclasses of MeasuredPhysicalProperty and generally follow the <ePropName/> ThermoML tag names. Some examples:

MassDensity - mass density
ExcessMolarEnthalpy - excess partial apparent molar enthalpy
HeatCapacity - molar heat capacity at constant pressure There are also some examples of measured physical properties we use exclusively for testing against synthetic data:
MeanPotentialEnergy - mean potential energy of a system
BondMoment - the nth moment (mean E[x] if n==1, or central moment E[(x-E[x])^n] if n >= 2) of a specified bond of an isolated molecule
AngleMoment - the nth moment (mean E[x] if n==1, or central moment E[(x-E[x])^n] if n >= 2) of a specified angle of an isolated molecule
TorsionMoment - the nth circular moment (E[e^{i*n*phi}]) of a specified torsion of an isolated molecule

molecule = IsolatedMolecule(smiles=smiles)
thermodynamic_state = ThermodynamicState(temperature=300*unit.kelvin)
mean_potential = MeanPotentialEnergy(molecule, thermodynamic_state, value=124.4*unit.kilojoules_per_mole, uncertainty=14.5*unit.kilojoules_per_mole)
bond_average = BondMoment(molecule, thermodynamic_state, value=1.12*unit.angstroms, uncertainty=0.02*unit.angstroms, moment=1, smirks='[#6:1]-[#6:2]')
bond_variance = BondMoment(molecule, thermodynamic_state, value=0.05*unit.angstroms**2, uncertainty=0.02*unit.angstroms**2, moment=2, smirks='[#6:1]-[#6:2]')
angle_average = AngleMoment(molecule, thermodynamic_state, value=20*unit.degrees, uncertainty=0.05*unit.degrees, moment=1, smirks='[#6:1]-[#6:2]-[#6:3]')
torsion_average = TorsionMoment(molecule, thermodynamic_state, value=(0.5,0.2), uncertainty=(0.1,0.1), moment=1, smirks='[#6:1]-[#6:2]-[#6:3]-[#6:4]')

QUESTIONS

Is it useful to average over any bonds that match the SMARTS strings? How would you even construct those SMARTS strings algorithmically for the molecules in the set, to avoid matching anything else?
Note that we need the first noncentral moment (average), so I had an exception where we don't compute the central moments for moment == 1
Do we really want the variance (second central moment) instead of the standard deviation?

davidlmobley commented 8 years ago

@jchodera :

That looks good to me.

How would you even construct those SMARTS strings algorithmically for the molecules in the set, to avoid matching anything else?

The main use case I see is this: in our initial testing we will take a forcefield in SMIRFF format (with initial, "bad" parameters but with fixed "atom types" (SMIRKS strings)) and fit parameters to reproduce parm@frosst. In this case, we already have the SMIRKS strings (they are in the SMIRFF file already, i.e. for AlkEthOH) and we are just trying to learn the parameters which should be associated with the SMIRKS strings. In this use case, they are going to be "handed to us" effectively. Though, more generally, once we have Smarty working for bonds, angles, and torsions, we will learn the SMIRKS patterns we are going to use for a specified set, then we would be feeding them into this.

Is it useful to average over any bonds that match the SMARTS strings?

I think we will ultimately be averaging across all matches, but we may want to get the un-averaged data back as well. Otherwise, I think there's a bit of a weighting problem, in that if molecule A has five matches for a particular SMIRKS and molecule B has only one, and I later want to combine, I don't know that molecule A's results should be weighted five times as much in the average.

Note that we need the first noncentral moment (average), so I had an exception where we don't compute the central moments for moment == 1

Perfect.

Do we really want the variance (second central moment) instead of the standard deviation?

Well, we should be able to get by with either, presumably. Seems like for API simplicity/consistency we can get by with the variance.

jchodera commented 8 years ago

OK, sounds like were making progress here. Can you flesh out a use case using the proposed API just a bit more, to make sure we're on the same page?

davidlmobley commented 8 years ago

@jchodera : Yes.

So for example, starting from what you have above, if I want to SPECIFY certain values:

molecule = IsolatedMolecule(iupac='ethanol')
thermodynamic_state = ThermodynamicState(temperature=300*unit.kelvin)

mean_potential = MeanPotentialEnergy(molecule, thermodynamic_state, value=124.4*unit.kilojoules_per_mole, uncertainty=14.5*unit.kilojoules_per_mole)
bond_average = BondMoment(molecule, thermodynamic_state, value=1.52*unit.angstroms, uncertainty=0.02*unit.angstroms, moment=1, smirks='[#6:1]-[#6:2]')
bond_variance = BondMoment(molecule, thermodynamic_state, value=0.05*unit.angstroms**2, uncertainty=0.02*unit.angstroms**2, moment=2, smirks='[#6:1]-[#6:2]')
angle_average = AngleMoment(molecule, thermodynamic_state, value=109.5*unit.degrees, uncertainty=0.05*unit.degrees, moment=1, smirks='[#6:1]-[#6:2]-[#6:3]')

We'd also have

torsion_moment_3 = TorsionMoment(molecule, thermodynamic_state, moment=3, value=30, uncertainty = 0.05, smirks='[#6X4:1]-[#6X4:2]-[#8X2:3]-[#1:4]')

for torsions, for example.

But we also want to tell the property estimator we want to calculate these values. We haven't totally made explicit what the dataset structure would be for this (I'm OK with it being ThermoMLDatasets if that's what's easiest for you/makes the most sense, but it could also just a simple collection of the above objects - i.e. `dataset = PhysicalPropertyDataset([bond_average, bond_variance, angle_average]).

Let's assume the latter for now. So then I'd be jumping off of the code above to do something like:

parameter_set = [ SMIRFFParameterSet('smarty-initial.xml') ]
dataset = PhysicalPropertyDataset( [bond_average, bond_variance, angle_average, torsion_moment_3]) 
# Compute physical properties for these measurements
estimator = PropertyEstimator(nworkers=10) # NOTE: multiple backends will be supported in the future
computed_properties = estimator.computeProperties(dataset, parameter_set)
# Write out statistics about errors in computed properties
for (computed, measured) in (computed_properties, dataset):
   property_unit = measured.value.unit
   print('%24s : parm@frosst value %8.3f (%.3f) | newly calculated %8.3f (%.3f) %s' % (measured.value / property_unit, measured.uncertainty / property_unit, computed.value / property_unit, computed.uncertainty / property_unit, str(property_unit))

Is that what you're after? Key point - it's just like we would be doing with ThermoML datasets except that the datasets themselves are rather different.

davidlmobley commented 8 years ago

Ultimately of course we'd be running through this same type of queries for all of the parameters in our dataset. i.e. we'd pull all the SMIRKS from the SMIRFF XML for AlkEthOH (https://github.com/open-forcefield-group/smarty/blob/master/smarty/data/forcefield/Frosst_AlkEtOH.ffxml) and then, with some other (non-parm@frosst) set of parameters:

Loop over all molecules in AlkEthOH
Get the average potential energy for each
For each molecule, loop over all smirks which match any vdw type/bond/angle/torsion
For each match, get the mean and variance (for bonds and angles) or a range of circular moments (for torsions)
Compare to reference values for these calculated with parm@frosst parameters

mrshirts commented 8 years ago

The atom numbers will be entirely meaningless unless we make the only way of specifying the IsolatedMolecule system be to provide an OEMol object

Well, we could just stick with the ordering given by the amber .prmtops. If SMIRKS matching was much easier/faster using OEMol objects, of course, that would be the way to go.

I think there are possibly some super-elegant ways to solve this problem, but I'm not sure they are the right way to go, as we want to focus super-elegant things on the main usage cases.

I agree with David about the workflow.

davidlmobley commented 8 years ago

There will be no amber prmtops anymore, and long term we'll want to have the tool chain be independent of any particular file format/molecular representation, so John is absolutely right that in the user-facing tool we have to ditch atom numbers in those. Smirks will be the way to go -functionally they will work well, and this is also rather in keeping with the whole FF philosophy so far where we use smirks for chemical environment selection.

On Mon, Jul 18, 2016, 6:35 PM Michael Shirts notifications@github.com wrote:

The atom numbers will be entirely meaningless unless we make the only way of specifying the IsolatedMolecule system be to provide an OEMol object

Well, we could just stick with the ordering given by the amber .prmtops. If SMIRKS matching was much easier/faster using OEMol objects, of course, that would be the way to go.

I think there are possibly some super-elegant ways to solve this problem, but I'm not sure they are the right way to go, as we want to focus super-elegant things on the main usage cases.

I agree with David about the workflow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/issues/10#issuecomment-233507184, or mute the thread https://github.com/notifications/unsubscribe-auth/AGzUYbBTC1T-2pozyY3GlHq1mqxcmq7lks5qXCn-gaJpZM4I8MWR .

mrshirts commented 8 years ago

There will be no amber prmtops anymore,

OK, I'm not quite seeing the workflow anymore for generating the data for single molecules from the simulations. The .nc (or .dcd) generated by OpenMM are processed, and how are we storing each of the individual properties computed? What sort of dictionary will we use to map between this data (that is recorded by atom number) and the SMIRKS representation and when will it be generated? It's not totally clear to me the sequence. Is SMIRKS used from the beginning to analyze the .nc files by querying the atom order from the internal OpenMM representation?

I suppose one way to do it (that I think its implied above) is to pull together and process the .nc data (by atom number), then store it in a data type that was an OEChem molecule of that type created from the .prmtop, then assigned property value to each bond (assuming the atom numbers are preserved from the .prmtop). Presumably any creation of the molecule would preserve bonds, since OpenMM is preserving the bond numbers when it creates the .nc file.

openforcefield / open-forcefield-tools

Decide API for equilibrium expectation datasets #10