openforcefield / open-forcefield-tools

Tools for open forcefield development
MIT License
8 stars 6 forks source link

Build Status

Property calculation toolkit

Property calculation toolkit from the Open Forcefield Consortium.

This toolkit provides an API for storing, manipulating, and computing measured physical properties from simulation data.

API docs

Physical property measurements

Physical property measurements are measured properties of a substance that provide some information about the physical parameters that define the interactions within the substance. A physical property is defined by a combination of:

An example of each:

Physical substances

We generally use the concept of a liquid or gas Mixture, which is a subclass of Substance.

A simple liquid has only one component:

liquid = Mixture()
liquid.addComponent('water')

A binary mixture has two components:

binary_mixture = Mixture()
binary_mixture.addComponent('water', mole_fraction=0.2)
binary_mixture.addComponent('methanol') # assumed to be rest of mixture if no mole_fraction specified

A ternary mixture has three components:

ternary_mixture = Mixture()
ternary_mixture.addComponent('ethanol', mole_fraction=0.2)
ternary_mixture.addComponent('methanol', mole_fraction=0.2)
ternary_mixture.addComponent('water')

The infinite dilution of one solute within a solvent or mixture is also specified as a Mixture, where the solute has zero mole fraction:

infinite_dilution = Mixture()
infinite_dilution.addComponent('phenol', impurity=True) # infinite dilution; one copy only of the impurity
infinite_dilution.addComponent('water')

For testing against synthetic data, we also make use of isolatedMolecule objects (also subclasses of Substance) that represent a single molecule:

phenol = IsolatedMolecule(iupac='phenol')
ethane = IsolatedMolecule(smiles='CC')

You can iterate over the components in a mixture:

for component in mixture.components:
    print (component.iupac_name, component.mole_fraction)

retrieve a component by name:

component = mixture.components['ethanol']

or get the number of components in a mixture:

ncomponents = mixture.ncomponents

or check if a component is an impurity:

if component.impurity == True:
    ...

Thermodynamic states

A ThermodynamicState specifies a combination of thermodynamic parameters (e.g. temperature, pressure) at which a measurement is performed.

from simtk import unit
thermodynamic_state = ThermodynamicState(pressure=500*unit.kilopascals, temperature=298.15*unit.kelvin)

We use the simtk.unit unit system from OpenMM for units (though we may later migrate to pint for portability).

Measurement methods

A MeasurementMethod subclass has information specific to the particular method used to measure a property (such as experimental uncertainty guidance).

Some examples:

Physical property measurements

A MeasuredPhysicalProperty is a combination of Substance, ThermodynamicState, and a unit-bearing measured property value and uncertainty:

# Define mixture
mixture = Mixture()
mixture.addComponent('water', mole_fraction=0.2)
mixture.addComponent('methanol')
# Define thermodynamic state
thermodynamic_state = ThermodynamicState(pressure=500*unit.kilopascals, temperature=298.15*unit.kelvin)
# Define measurement
measurement = ExcessMolarEnthalpy(substance, thermodynamic_state, value=83.3863244*unit.kilojoules_per_mole, uncertainty=0.1220794866*unit.kilojoules_per_mole)

The various properties are all subclasses of MeasuredPhysicalProperty and generally follow the <ePropName/> ThermoML tag names.

Some examples of MeasuredPhysicalProperty:

There are also some examples of measured physical properties we use exclusively for testing against synthetic data:

For example:

molecule = IsolatedMolecule(smiles=smiles)
thermodynamic_state = ThermodynamicState(temperature=300*unit.kelvin)
mean_potential = MeanPotentialEnergy(molecule, thermodynamic_state, value=124.4*unit.kilojoules_per_mole, uncertainty=14.5*unit.kilojoules_per_mole)
bond_average = BondMoment(molecule, thermodynamic_state, value=1.12*unit.angstroms, uncertainty=0.02*unit.angstroms, moment=1, smirks='[#6:1]-[#6:2]')
bond_variance = BondMoment(molecule, thermodynamic_state, value=0.05*unit.angstroms**2, uncertainty=0.02*unit.angstroms**2, moment=2, smirks='[#6:1]-[#6:2]')
angle_average = AngleMoment(molecule, thermodynamic_state, value=20*unit.degrees, uncertainty=0.05*unit.degrees, moment=1, smirks='[#6:1]-[#6:2]-[#6:3]')
torsion_moment = TorsionMoment(molecule, thermodynamic_state, value=(0.5, 0.3), uncertainty=(0.05, 0.03), moment=1, smirks='[#6:1]-[#6:2]-[#6:3]-[#6:4]')

QUESTIONS

A roadmap of physical properties to be implemented is available. Please raise an issue if your physical property of interest is not listed!

Each MeasuredPhysicalProperty has several properties:

The value, uncertainty, reference, and DOI do not necessarily need to be defined for a dataset in order for property calculations to be performed.

Physical property datasets

A PhysicalPropertyDataset is a collection of MeasuredPhysicalProperty objects that are related in some way.

dataset = PhysicalPropertyDataset([measurement1, measurement2])

The dataset is iterable:

dataset = PhysicalPropertyDataset([measurement1, measurement2])
for measurement in dataset:
    print measurement.value

and has accessors to retrieve DOIs and references associated with measurements in the dataset:

# Print the DOIs associated with this dataset
print(dataset.DOIs)
# Print the references associated with this dataset
print(dataset.references)

For convenience, you can retrieve the dataset as a pandas DataFrame:

dataset.to_pandas()

ThermoML datasets

A ThermoMLDataset object represents a physical property dataset stored in the IUPAC-standard ThermoML for specifying thermodynamic properties in XML format. ThermoMLDataset is a subclass of PhysicalPropertyDataset, and provides the same API interface (in addition to some ThermoML-specfic methods). Direct access to the NIST ThermoML Archive is supported for obtaining physical property measurements in this format directly from the NIST TRC repository.

For example, to retrieve the ThermoML dataset that accompanies this paper, we can simply use the DOI 10.1016/j.jct.2005.03.012 as a key for creating a PhysicalPropertyDataset subclassed object from the ThermoML Archive:

dataset = ThermoMLDataset(keys='10.1016/j.jct.2005.03.012')

You can also specify multiple ThermoML Archive keys to create a dataset from multiple ThermoML files:

thermoml_keys = ['10.1021/acs.jced.5b00365', '10.1021/acs.jced.5b00474']
dataset = ThermoMLDataset(keys=thermoml_keys)

It is also possible to specify ThermoML datasets housed at other locations, such as

dataset = ThermoMLDataset(url='http://openforcefieldgroup.org/thermoml-datasets')

or

dataset = ThermoMLDataset(url='file:///Users/choderaj/thermoml')

or

dataset = ThermoMLDataset(keys=['10.1021/acs.jced.5b00365', '10.1021/acs.jced.5b00474'], url='http://openforcefieldgroup.org/thermoml-datasets')

or from ThermoML and a different URL:

dataset = ThermoMLDataset(keys=thermoml_keys)
dataset.retrieve(keys=local_keys, url='http://openforcefieldgroup.org/thermoml-datasets')

You can see which DOIs contribute to the current ThermoMLDataset with the convenience functions:

print(dataset.DOIs)

NIST has compiled a JSON frame of corrections to uncertainties. These can be used to update or correct data uncertainties and discard outliers using applyNISTUncertainties():

# Modify uncertainties according to NIST evaluation
dataset.applyNISTUncertainties(nist_uncertainties, adjust_uncertainties=True, discard_outliers=True)

TODO:

Other datasets

In future, we will add interfaces to other online datasets, such as

Estimating properties

The PropertyEstimator class creates objects that handle property estimation of all of the properties in a dataset, given a set or sets of parameters. The implementation will isolate the user from whatever backend (local machine, HPC cluster, XSEDE resources, Amazon EC2) is being used to compute the properties, as well as whether new simulations are being launched and analyzed or existing simulation data is being reweighted. Different backends will take different optional arguments, but here is an example that will launch and use 10 worker processes on a cluster:

estimator = PropertyEstimator(nworkers=10) # NOTE: multiple backends will be supported in the future
computed_properties = estimator.computeProperties(dataset, parameter_sets)

Here, dataset is a PhysicalPropertyDataset or subclass, and parameter_sets is a list containing ForceField objects used to parameterize the physical systems in the dataset. This can be a single parameter set or multiple (usually closely related) parameter sets.

PropertyEstimator.computeProperties(...) returns a list of ComputedPhysicalProperty objects that provide access to several pieces of information:

TODO:

This API can be extended in the future to provide access to the simulation data used to estimate the property, such as

# Attach to my compute and storage resources
estimator = PropertyEstimator(...)
# Estimate some properties
computed_properties = estimator.computeProperties(dataset, parameters)
# Get statistics about simulation data that went into each property
for property in computed_properties:
   # Get statistics about simulation data that was reweighted to estimate this property
   for simulation in property.simulations:
      print('The simulation was %.3f ns long' % (simulation.length / unit.nanoseconds))
      print('The simulation was run at %.1f K and %.1f atm' % (simulation.thermodynamic_state.temperature / unit.kelvin, simulation.thermodynamic_state.pressure / unit.atmospheres))
      # Get the ParameterSet that was used for this simulation
      parameters = simulation.parameters
      # what else do you want...?

In future, we will want to use a parallel key/value database like cassandra to store simulations, along with a distributed task management system like celery with redis.

API Usage Examples

Using the high-level API

In this example, datasets are retrieved from the ThermoML and filtered to retain certain properties. The corresponding properties for a given parameter set filename are then computed for a SMIRFF parameter set and printed.

# Define the input datasets from ThermoML
thermoml_keys = ['10.1016/j.jct.2005.03.012', ...]
dataset = ThermoMLDataset(thermoml_keys)
# Filter the dataset to include only molar heat capacities measured between 280-350 K
dataset.filter(ePropName='Excess molar enthalpy (molar enthalpy of mixing), kJ/mol') # filter to retain only this property name
dataset.filter(VariableType='eTemperature', min=280*unit.kelvin, max=350*kelvin) # retain only measurements with `eTemperature` in specified range
# Load an initial parameter set
parameter_set = [ SMIRFFParameterSet('smarty-initial.xml') ]
# Compute physical properties for these measurements
estimator = PropertyEstimator(nworkers=10) # NOTE: multiple backends will be supported in the future
computed_properties = estimator.computeProperties(dataset, parameter_set)
# Write out statistics about errors in computed properties
for (computed, measured) in (computed_properties, dataset):
   property_unit = measured.value.unit
   print('%24s : experiment %8.3f (%.3f) | calculated %8.3f (%.3f) %s' % (measured.value / property_unit, measured.uncertainty / property_unit, computed.value / property_unit, computed.uncertainty / property_unit, str(property_unit))

Using the low-level API

TBD