Datasets for open forcefield parameterization and development
ThermoML data compiled and filtered using ThermoPyL tool developed by Chodera Lab @ MSKCC (https://github.com/choderalab/thermopyl)
FILTER PROCEDURE:
Pull full ThermoML archive
Discard known erroneous data (j.fluid.2013.12.014 the only one I know of now)
Define properties of interest to pass filter
Allow only C, O and H atoms to pass
Generate SMILES formulae from component names (NIH CirPy module)
Apply filter for "=" and "#" to SMILES formulae (get rid of double and triple bonding)
Generate CAS from component names (CirPy)
Apply temperature and pressure filters (250 K - 400 K and 1 atm - 1000 atm)
Keep only liquid phase data points
Separate final large dataframe into subframes by property of interest a. Remove data with no associated uncertainties from subframes
Generate counts by component and journal article for all dataframes
Save everything as separate text .csv
Christopher I. Bayly developed a toy dataset of potential molecules of interest which is deposited in the "Model Systems" directory in the "AlkEthOH_distrib" subdirectory. Construction of this set is described in the README.txt there, which should be converted to md.
This is an attempt to create a set of molecules that use all the parameters in the smirnoff99Frosst force field.