openforcefield / nistdataselection

Records the tools and decisions used to select NIST data for curation.
MIT License
3 stars 0 forks source link

More Robust Way to Chose Data Points #4

Closed SimonBoothroyd closed 5 years ago

SimonBoothroyd commented 5 years ago

Description

This PR attempts to add a more robust method for choosing which state points (T, p) to include in the final data set.

The previous approach was to just choose the data points at the lowest, middle and highest temperatures for each property and each substance. This was not a good choice however. As an example, consider if we had the following available data for a given substance:

Densities: (285K, 1atm), (290K, 1atm), (298K, 1atm), (305K, 1atm), (310K, 1atm)
Dielectrics: (285K, 1atm), (298K, 1atm), (300K, 1atm), (305K, 1atm), (310K, 1atm)

the previous approach would choose for the final data set:

Densities: (285K, 1atm), (298K, 1atm), (310K, 1atm)
Dielectrics: (285K, 1atm), (300K, 1atm), (310K, 1atm)

as these are the properties with the minimum, middle and maximum temperatures respectively. This would require a total of four separate simulations to estimate all of the properties however.

The new approach emphasizes trying to find data points which still give good coverage over the temperature and pressure ranges of interest, but for which we have data for the largest number of properties - i.e we'd preferentially choose data points for which we have both enthalpy and density measures, as opposed to those data points for which we just have density measurements. This will in principle allow us to (as best as possible) use the minimal number of simulations to estimate the property sets. In the above example, the chosen set with the new approach would be:

Densities: (285K, 1atm), (298K, 1atm), (310K, 1atm)
Dielectrics: (285K, 1atm), (298K, 1atm), (310K, 1atm)

thus requiring one less simulation. The minimal saving in this example quickly adds up when considering many different properties and substances.

Status