Initial Molecule Choices: My Few Thoughts

bmanubay commented 8 years ago

Hey everyone,

I hope you're all not too overwhelmed with all the changes to the GitHub repo lately! I've ran into a lot of extra time to focus on this work and brainstorm with @mrshirts since the end of the semester and I have a few thoughts on which molecules we might want to start with for the parameterization process. I'd like to preface this discussion with, "Please take some time to look through what I've changed in my last few merges to the main repo. Pay close attention to the 'Initial Molecule Choices' directory as there is a lot of new information there which is hopefully explained in a succinct and accessible manner." And with that, please read on with an open (yet critical) mind.

My thoughts on potential choices are binary, depending on what type of diversity in molecular structure we'd like to begin with. I'll split these potential choices into those excluding aromatic bonds (set "XAr") and those including aromatic bonds (set "Ar"). Given the wide spread of data across properties for the highest ranked (highest number of total data points) on 'allcomp_counts_interesting.csv' in the 'Initial Molecule Choices' directory, it should be sufficient to start with as few as 5 molecules for either set.

Set "XAr": water, ethanol, 1-butanol, heptane, methyl tert-butyl ether -Why? -Significant data coverage across all individual species (see 'allcomp_counts_interesting.csv') -Fair data coverage for mixture combinations (see 'mix_counts_interesting.csv')

Set "Ar": water, 1-butanol, heptane, methyl tert-butyl ether, toluene -For the same reasons as the "XAr" set

Note that some of the properties have a very small range of molecules (or combination of molecules) for which there is data. Therefore, it is not possible to cover every property of interest with the suggested sets above. An alternative method to remedy this, depending on how many molecules we would be willing to start with, would be to go through the top pure solvents and mixtures per property and use all molecules in that list. It is very likely that the diversity remains high even then.

davidlmobley commented 8 years ago

I'll be meeting with Christopher Bayly this week to go over your ideas/this set and try and get back to you. Currently we're thinking this is likely to happen tomorrow night.

davidlmobley commented 8 years ago

So, Christopher and I spent a good chunk of time last night looking through what's in the "allcomp_counts_diverse" set last night, and we have some feedback on what we want in the initial set below. For some reason I didn't have this issue up when we were doing it (I think because there are so many different places the data is described, i.e. README.md files, e-mails, issues, etc.) so it was hard to find the best place to start. So, I'll start with what we found last night, then I'll respond to your comment above. I have remarks in a couple specific areas which I'll put in bold.

We think you should go a little further via a programmatic approach to picking data

Ignore sparse data types for now, particularly vmol and cpmol, the latter of which will be harder to calculate and is sparse
Remove/merge all duplicates; as noted by e-mail your 77 compounds in the allcomp_counts_diverse set is really only 66 unique SMILES/compounds from what we can tell
Remove all compounds which do not have (much) prioritized data below, with certain exceptions for prioritized compounds (see below)

We think much of the remaining data will ultimately be interesting, but some can be prioritized

Our prioritized list of data types are (for pure solutions) density and molar enthalpy and (for binary mixtures) density and excess molar enthalpy.
We are also very interested in gathering the dielectric constant data/getting it curated so we can later use it, but we will not use it initially
I want to investigate how the speed of sound is calculated (or someone can explain); it could be interesting but again this is for later.

As noted above, this section on prioritizing could be used to further reduce your set by basically saying we only want compounds (except see below) which have a substantial amount of density data for both pure and mixtures, AND a substantial amount of enthalpy data for mixtures and pure solutions.

There are certain compounds which are particularly high priority to include Some compounds include steric constraints/clashes, excess hydrogen bond donors or acceptors, and other things that make them particularly interesting. Christopher identified a list of compounds he very much wants to make sure we have data on:

2,2,4-trimethylpentane
cycloheptane
diisopropylether/isopropyl_ether
dimethoxymethane
2,3-dimethylbutane and 2,2-dimethylbutane
3-methylpentane
neohexane
4-methyl-2-pentanol and 2-methyl-2-pentanol
1,1-diethoxyethane
tert-butanol (excess donor -- too constrained to accept)
tetrahydrofuran

On mixtures, we agree that alkane-alkane mixtures are not particularly interesting and basically can be excluded.

Some other random requests/notes

Can you check for data on 1,2-dimethoxyethane? Christopher very much wants this but we do not see it in the set. (It is a good model for PEG)
Mixtures - we want to be sure to include some involving tert-butanol because it is an excess donor; mixtures of tert-butanol and THF (which there is not data listed presently) would be something to study eventually, though we might need to do new measurements for that.
We didn't look at how the filtering was done, so it's worth remarking that we don't need a huge amount of independent measurements of the same properties in the same very narrow temperature/pressure range so possibly the number of data points could be filtered down that way?

davidlmobley commented 8 years ago

OK, so given all that, a key question is how much data we can actually ask him for. Maybe we can have him run the "red flag check" (Michael mentions here: https://openforcefieldgroup.slack.com/archives/datasets/p1463580427000061) on everything we have left after the "programmatic approach" discussed above, and then use that to actually pick specific data points to ask for uncertainties in?

We are going to need uncertainties for everything we proceed with, so a key question is exactly how many uncertainties he can get us and how fast, and whether this is a "by individual data point" investment of effort or a "by research paper" or "by research group" effort or something else. ( @mrshirts and @bmanubay ) In other words, we need to know what the uncertainties cost and what our budget is.

davidlmobley commented 8 years ago

@bmanubay to answer your specific questions from the post above:

...in the 'Initial Molecule Choices' directory, it should be sufficient to start with as few as 5 molecules for either set.

Set "XAr": water, ethanol, 1-butanol, heptane, methyl tert-butyl ether -Why? -Significant data coverage across all individual species (see 'allcomp_counts_interesting.csv') -Fair data coverage for mixture combinations (see 'mix_counts_interesting.csv')

Set "Ar": water, 1-butanol, heptane, methyl tert-butyl ether, toluene -For the same reasons as the "XAr" set

Note that some of the properties have a very small range of molecules (or combination of molecules) for > which there is data.

OK, so possibly these sets could be interesting except we'd want to swap in at least a couple of this list instead:

2,2,4-trimethylpentane cycloheptane diisopropylether/isopropyl_ether dimethoxymethane 2,3-dimethylbutane and 2,2-dimethylbutane 3-methylpentane neohexane 4-methyl-2-pentanol and 2-methyl-2-pentanol 1,1-diethoxyethane tert-butanol (excess donor -- too constrained to accept) tetrahydrofuran

But, really, the approach you're describing is NOT what we're envisioning at present. It's going to be much easier for us to do more molecules for few properties than more properties for few molecules. So we'd rather have a larger set of solvents and mixtures and fewer properties (initially) than to have more properties and fewer solvents/mixtures. Later this will change, but our very first tests will be solely density, then we will incorporate enthalpy data next. After those, we will branch in other directions.

So, really, we want more (than five) molecules but less data point types. Make sense?

mrshirts commented 8 years ago

Later this will change, but our very first tests will be solely density, then we will incorporate enthalpy data next. After those, we will branch in other directions.

So, I want to push back on this a little. It's very clear that density is very sensitive to some parameters, and not at all sensitive to other parameters. I think that we really do want to try to look at 2 types of data to start; I think that will give us a much better sense of the constraints on the parameters.

davidlmobley commented 8 years ago

So, I want to push back on this a little. It's very clear that density is very sensitive to some parameters, and not at all sensitive to other parameters. I think that we really do want to try to look at 2 types of data to start; I think that will give us a much better sense of the constraints on the parameters.

This is actually exactly what we want! The first pass here is basically to show how the machinery works and what it does and doesn't do. So we want to show that including the density data does push some of the parameters, and not others, and that it leaves some of the parameters very unconstrained (like what Alan Mark just talked about). Then we want to show how including additional data further constrains parameters and pushes things further from where we started.

So, we deliberately want to start with only one type of data that leaves some parameters unconstrained to make it easier to show this. It's part of ours sales pitch, basically. We want to show how to automatically do what Alan Mark was doing by hand.

davidlmobley commented 8 years ago

To put that another way - if we were initially trying to come up with good constraints on as many parameters as possible, then we would absolutely want to do what you're describing. But instead, our goal at this point is to demonstrate what this can do, how it can automate this, and to set up the framework. We are not trying over the summer to actually come up with better parameters for anything, nor is it really important that we constrain them that tightly.

bmanubay commented 8 years ago

Great, this is exactly the type of feedback that I wanted! I'll get right on looking at the diverse molecule set for the narrower range of properties ASAP! I'll start with the density, enthalpy and dielectric data.

I've gotten pretty proficient with using pandas, so these type of dictionary searches on the compounds you mentioned should be simple. Thanks folks!

One quick question, which I expect to get an unsatisfying/arbitrary answer for now. What do we think would be a "sufficient amount of data" for a compound to make any of these lists?

davidlmobley commented 8 years ago

@bmanubay - I talked to Michael over lunch and we anticipate we would probably use heat capacity data (even though it is more sparse) relatively soon too, i.e. perhaps we'd first do density and then in parallel try enthalpy and heat capacity. I wanted to add that since it's different from what I said above.

One quick question, which I expect to get an unsatisfying/arbitrary answer for now. What do we think would be a "sufficient amount of data" for a compound to make any of these lists?

I don't think it takes much - for example, if we had a single 300K/atmospheric pressure data point for a particular compound for each of pure solvent density, enthalpy, and heat capacity, and for the equivalent binary mixture properties, that would potentially be interesting to me. And I'd even be OK with more sparse than that. Key is having it across properties though.

But a key factor will be what our "budget" is for the uncertiainties -- any data we are going to use, we will need uncertainties for. So, if we can easily get uncertainty estimates for, say, a few hundred data points or less then we will pick those data points very carefully.

bmanubay commented 8 years ago

But a key factor will be what our "budget" is for the uncertiainties -- any data we are going to use, we will need uncertainties for. So, if we can easily get uncertainty estimates for, say, a few hundred data points or less then we will pick those data points very carefully.

Understood! I sent that email to Ken and am awaiting what he has to say.

openforcefield / open-forcefield-data

Initial Molecule Choices: My Few Thoughts #13