openforcefield / open-forcefield-tools

Tools for open forcefield development
MIT License
8 stars 6 forks source link

Discuss: PropertyCalculator and PropertyEstimator frameworks? #1

Closed davidlmobley closed 8 years ago

davidlmobley commented 8 years ago

So, my group could go ahead and start doing things like running density calculations of solutions using OpenMM, but it would be good to cast this into the framework of what we will need for PropertyCalculator and PropertyEstimator, and these two will need to share some sort of data structure (i.e. in that PropertyEstimator will presumably be using information from existing simulations which have already been run in order to make estimates).

What information will these be taking in and returning, and what sort of data structure are we envisioning? We discussed them taking in an OpenMM system, I believe, and computing a specified property (Or would PropertyEstimator take in a perturbation to an existing system?). However, PropertyEstimator will at least need access to stored trajectory information among other things - likely some sort of library of what systems have been run already. Perhaps we'd be providing a set of OpenMM systems and the corresponding property/uncertainty estimates for those systems?

@mrshirts ? @jchodera ?

davidlmobley commented 8 years ago

Oh, yes, right, I forgot. That's terrible (especially given that Bryce has already found some of the names are corrupted by humans!) but yes, I agree. Ugh.

Kroenlein is gone til June 1 but we will have to be in touch with him them.

jchodera commented 8 years ago

@bmanubay : Should we be defining properties by their PropertyGroup names (e.g. VolumetricProp, ExcessPartialApparentEnergyProp) or by their ePropName name (e.g. Mass density, kg/m3)?

If by ePropName, are these standardized enough that we won't have, say, Mass density, kg/m3 and Mass density, g/cm3?

jchodera commented 8 years ago

@davidlmobley: For substances, should we have something like

liquid = NeatLiquid(iupac='water')
binary_mixture = BinaryMixture(component1_iupac='water', mole_fraction1=0.2, component2_iupac='methanol')
ternary_mixtrue = TernaryMixture(component1_iupac='ethanol', mole_fraction1=0.2, component2_iupac='methanol', mole_fraction2=0.2, component3_iupac='water')
infinite_dilution = InfiniteDilutionMixture(solvent='water', solute='phenol')

or should we try to generalize this to a single kind of container

liquid = Mixture()
liquid.addComponent('water')

binary_mixture = Mixture()
binary_mixture.addComponent('water', mole_fraction=0.2)
binary_mixture.addComponent('methanol') # assumed to be rest of mixture if no mole_fraction specified

ternary_mixture = Mixture()
binary_mixture.addComponent('ethanol', mole_fraction=0.2)
binary_mixture.addComponent('methanol', mole_fraction=0.2)
ternary_mixture.addComponent('water')

infinite_dilution = Mixture()
infinite_dilution.addComponent('phenol', mole_fraction=0.0) # infinite dilution
infinite_dilution.addComponent('water')
davidlmobley commented 8 years ago

Sorry, rephrasing/clarifying/fixing: Let's generalize as a single kind of container, actually. The only one that would make any sense to me to treat separately is NeatLiquid, but it would be simple enough to make this fully general. I'll handle this in SolvationToolkit.

davidlmobley commented 8 years ago

(comment updated)

bmanubay commented 8 years ago

@jchodera I'd rather go by something like ePropName.

Also, yes, all of the units are standard (SI to be specific). I can print out a list of all of the properties contained in the ThermoML database later on if you'd like.

jchodera commented 8 years ago

For now, the only important thing to establish is whether we will have a one-to-one correspondence between our classes and the ePropName entries in ThermoML. If there really is only one kind of mass density name, then we are OK. If there are multiple, then it would be important to know that now.

I think I have enough information to start building the API draft now. Thanks again for all the rapid feedback!

bmanubay commented 8 years ago

@jchodera John, one thing that I missed yesterday during the API discussion. The "data set" keys aren't in the exact format you have written if they're pulled from ThermoML.

You wrote: keys = ['10.1016/j.jct.2005.03.012', ...]

From ThermoML they will appear as: keys = ['j.jct.2005.03.012', ...]

For whatever reason the actual DOI number is unreported, but the article abbreviation is sufficient to find the data.

mrshirts commented 8 years ago

Some quantities are easier to calculate with NPT simulations, and some are easier to calculate with NVT simulations. There's always ways to get around that, (for example, (dP/dT)_V can be calculated as - (dV/dT)_P / (dV/dT)_P ), but that might be something to think about specifying in the calculation.

On Thu, May 26, 2016 at 3:31 PM, bmanubay notifications@github.com wrote:

@jchodera https://github.com/jchodera John, one thing that I missed yesterday during the API discussion. The "data set" keys aren't in the exact format you have written if they're pulled from ThermoML.

You wrote: keys = ['10.1016/j.jct.2005.03.012', ...]

From ThermoML they will appear as: keys = ['j.jct.2005.03.012', ...]

For whatever reason the actual DOI number is unreported, but the article abbreviation is sufficient to find the data.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/issues/1#issuecomment-222000888

jchodera commented 8 years ago

For whatever reason the actual DOI number is unreported, but the article abbreviation is sufficient to find the data.

@bmanubay: Is there a way to programmatically retrieve the XML file from the ThermoML Archive via the shorter key?

mrshirts commented 8 years ago

One question here is how automated we want the pull from the ThermoML archive. Right now, Bryce has the filtering of THIS data set fairly automated, but there's no way to guarantee that we'll have correct filtering from every data set. So we might want it to pull from some list of xml files that has the same format as ThermoML, but that allows us the option of post-processing.

On Thu, May 26, 2016 at 3:42 PM, Michael Shirts mrshirts@gmail.com wrote:

Some quantities are easier to calculate with NPT simulations, and some are easier to calculate with NVT simulations. There's always ways to get around that, (for example, (dP/dT)_V can be calculated as - (dV/dT)_P / (dV/dT)_P ), but that might be something to think about specifying in the calculation.

On Thu, May 26, 2016 at 3:31 PM, bmanubay notifications@github.com wrote:

@jchodera https://github.com/jchodera John, one thing that I missed yesterday during the API discussion. The "data set" keys aren't in the exact format you have written if they're pulled from ThermoML.

You wrote: keys = ['10.1016/j.jct.2005.03.012', ...]

From ThermoML they will appear as: keys = ['j.jct.2005.03.012', ...]

For whatever reason the actual DOI number is unreported, but the article abbreviation is sufficient to find the data.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/issues/1#issuecomment-222000888

bmanubay commented 8 years ago

@jchodera An example of the way the file name keys appear originally (no interaction besides pulling the web data) is:

/some/local/directory/j.jct.2005.03.012.xml

What I've been doing is stripping away the file path and .xml extension in order to just leave the abbreviated article name. We'll say these names are stored in a single column of our data frame (we'll call it df) and the column header is filename. To perform the 'string stripping' operation you could do the following:

df["filename"] = df.filename.map(lambda x: x.lstrip('/some/local/directory/')) df["filename"] = df.filename.map(lambda x: x.replace(' ', '')[:-4])

If I had a list of these names, say:

keys = ['j.jct.2005.03.012', ...]

and the data frame pulled from ThermoML (with the filename column formatted like above). I could write:

ind = df.filename.isin(keys) df = df[ind]

You should be left with all of the data from the articles specified in keys

jchodera commented 8 years ago

Some quantities are easier to calculate with NPT simulations, and some are easier to calculate with NVT simulations. There's always ways to get around that, (for example, (dP/dT)_V can be calculated as - (dV/dT)_P / (dV/dT)_P ), but that might be something to think about specifying in the calculation.

The user will be isolated from this by having the PropertyCalculator figure out what the best strategy is. The most important thing is to capture the experimental conditions under which the measurement was made.

jchodera commented 8 years ago

@bmanubay: What about retrieval from the ThermoML Archive online directly? Is there a way to look up the URL from the short key alone, or is locating the XML file only possible once you have retrieved a full local copy of the Archive?

bmanubay commented 8 years ago

@jchodera I'm unsure of a way to do it programmatically. I'd have to look into that further. You can certainly find the articles via Google search from the short key.

I'll have to do a little fiddling with the ThermoPyL library to give a satisfying answer though.

davidlmobley commented 8 years ago

@bmanubay - I don't actually think he's asking about the articles, I think he's asking about getting the XML file; can retrieving the XML file be done without having the full ThermoML archive locally? i.e., if we already have a list of what IDs we want (i.e., you hand it to me) do I have to pull the whole archive again or can i just retrieve the XML files for the specific IDs I want?

bmanubay commented 8 years ago

@davidlmobley - Got it! Still unsure. I'll have to see what I can do with the ThermoPyL library.

jchodera commented 8 years ago

This may be another question for Kroenlein. For now, we'll retrieve the whole ThermoML Archive and search for the articles we want, but it would be useful to be able to compile the subset we want and have the tool download only the necessary data.

Note that if you have the whole DOI, you can easily form the URL as http://trc.boulder.nist.gov/ThermoML/ + DOI, as in:

http://trc.boulder.nist.gov/ThermoML/10.1016/j.jct.2005.03.012
davidlmobley commented 8 years ago

@jchodera - any headway on API draft for this? Is there anything you need before you can do it?

jchodera commented 8 years ago

I've opened a PR to document the proposed API:

https://github.com/open-forcefield-group/open-forcefield-tools/pull/3

davidlmobley commented 8 years ago

I think this is closed by the API draft in https://github.com/open-forcefield-group/open-forcefield-tools/pull/3 ?

jchodera commented 8 years ago

Yep!