openforcefield / open-forcefield-tools

Tools for open forcefield development
MIT License
8 stars 6 forks source link

Discuss: PropertyCalculator and PropertyEstimator frameworks? #1

Closed davidlmobley closed 8 years ago

davidlmobley commented 8 years ago

So, my group could go ahead and start doing things like running density calculations of solutions using OpenMM, but it would be good to cast this into the framework of what we will need for PropertyCalculator and PropertyEstimator, and these two will need to share some sort of data structure (i.e. in that PropertyEstimator will presumably be using information from existing simulations which have already been run in order to make estimates).

What information will these be taking in and returning, and what sort of data structure are we envisioning? We discussed them taking in an OpenMM system, I believe, and computing a specified property (Or would PropertyEstimator take in a perturbation to an existing system?). However, PropertyEstimator will at least need access to stored trajectory information among other things - likely some sort of library of what systems have been run already. Perhaps we'd be providing a set of OpenMM systems and the corresponding property/uncertainty estimates for those systems?

@mrshirts ? @jchodera ?

mrshirts commented 8 years ago

I would imagine that PropertyCalculator would information from the simulation summary data (volume, energies) and not the trajectory. Even the single-parameter reweighting (MBAR to calculate temperature derivatives) could be done with just current simulation. Property Estimator would likely involve recalculating the information from multiple trajectories.

jchodera commented 8 years ago

Apologies for not having a chance to prototype this sooner. I will see if I can put something together today.

The basic framework is just a few dozen lines, but we want to set up all the boilerplate to make it easy for multiple people to contribute.

@maxentile is investigating what is needed to bolt fancier Bayesian and ML sampling methods on to ForceBalance in parallel.

davidlmobley commented 8 years ago

@mrshirts - have you (as the reweighting/efficiency guru) given any thought as to how we would organize the trajectories for PropertyEstimator? I can imagine a range of solutions ranging from having some sort of database structure planned out, to asking the user just to specify paths to a set of trajectories and associated parameters (etc.).

@jchodera - agreed that it should initially be brief but we want something we can expand to cover a lot and have it be easy for many people to contribute.

jchodera commented 8 years ago

For now, I think the simplest approach is to make a new NetCDF or HDF5 file for each simulation we perform, and sequentially number them so that the most recent files have a higher number. Each NetCDF/HDF5 file can cache the parameters that were used along with the System object, energies, snapshot frames, box coordinates, etc. This way, we can also easily run and process in parallel.

Once this gets beyond the simple test stage and gets more serious, we may want to use a parallel key/value database like cassandra, and a distributed task management system like celery with redis. We have a working prototype of that here, but it's likely more complexity than we need at the moment.

mrshirts commented 8 years ago

For now, I think the simplest approach is to make a new NetCDF or HDF5 file for each simulation we perform, and sequentially number them so that the most recent files have a higher number.

One thing to consider is that if we are running a number of simulations, then "most recent" might not be a meaningful measure. The order in which the simulations start might not be the same as the order in which they finish. So perhaps if we just make sure they have a unique identifier, and all of the simulations that are left in a given folder are analyzed. One could also maintain lists of which of them should be identified, which associated each unique ID with the parameters that were used perform runs, and only some subset of the trajectories could be rerun based on various criteria. In the short term, probably rerunning everything would be fine.

Once this gets beyond the simple test stage and gets more serious, we may

want to use a parallel key/value database like cassandra http://cassandra.apache.org, and a distributed task management system like celery http://www.celeryproject.org with redis https://www.google.com/search?client=safari&rls=en&q=redis&ie=UTF-8&oe=UTF-8. We have a working prototype of that here https://github.com/choderalab/gbff, but it's likely more complexity than we need at the moment

Or something like this could provide the information associating each trajectory ID with the metadata.

davidlmobley commented 8 years ago

For now, I think the simplest approach is to make a new NetCDF or HDF5 file for each simulation we perform, and sequentially number them so that the most recent files have a higher number. Each NetCDF/HDF5 file can cache the parameters that were used along with the System object, energies, snapshot frames, box coordinates, etc. This way, we can also easily run and process in parallel.

OK, I agree that this sounds like a great approach for the time being. I haven't used these formats much yet - if they can store the parameters and System object as well as all those other details it will simplify things a lot, as then we have much much less file bookkeeping to do.

I agree with Michael that recency may or may not be that useful, though it will certainly be worth knowing in some cases. We'll probably ultimately end up basing a lot on "closeness of parameters" in some sense - probably more so on that than on recency.

davidlmobley commented 8 years ago

On this, @mrshirts and @jchodera , I understand there was some discussion of whether we need to bake the ability to do things like calculate hydration free energies into the framework, or whether we should focus on properties that can be calculated from single simulation rather than a set. Should we perhaps have:

davidlmobley commented 8 years ago

And perhaps PropertyCalculator would just call out to AlchemicalPropertyCalculator if the property asked for is one that doesn't come from an individual physical system, so the user doesn't have to worry about this?

mrshirts commented 8 years ago

I think it makes sense to have two code paths. Could even have a AlchemicalPropertyCalculator and SimplePropertyCalculator that the PropertCalculator calls, then we can decide later how to implement the AlchemicalPropertyCalculation.

With a density calculator, we can then move forward on implementing other versions.

jchodera commented 8 years ago

Let me briefly talk to Patrick and Josh on Monday about this.

J On May 20, 2016 8:53 PM, "Michael Shirts" notifications@github.com wrote:

I think it makes sense to have two code paths. Could even have a AlchemicalPropertyCalculator and SimplePropertyCalculator that the PropertCalculator calls, then we can decide later how to implement the AlchemicalPropertyCalculation.

With a density calculator, we can then move forward on implementing other versions.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/issues/1#issuecomment-220749097

davidlmobley commented 8 years ago

@jchodera - Just a reminder to talk to Patrick and Josh on this. I know you were traveling and it's easy to get backlogged and have things slip through the tracks.

jchodera commented 8 years ago

Thanks! I have some notes/questions I will share via a GitHub issue to nucleate discussion.

J

On Mon, May 23, 2016 at 11:40 AM, David L. Mobley notifications@github.com wrote:

@jchodera https://github.com/jchodera - Just a reminder to talk to Patrick and Josh on this. I know you were traveling and it's easy to get backlogged and have things slip through the tracks.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/issues/1#issuecomment-221013412

John D. Chodera Assistant Faculty Member, Computational Biology Memorial Sloan-Kettering Cancer Center email: j choderaj@mskcc.orgohn.chodera@choderalab.org office: 646.888.3400 fax: 510.280.3760 mobile: 415.867.7384 url: http://www.choderalab.org

davidlmobley commented 8 years ago

@jchodera - just a reminder to share these as soon as you can. Thanks.

jchodera commented 8 years ago

Is it possible for you to compile a list of data we want to include, both in the short term and long term?

For example:

Short term

Intermediate term

Long term

davidlmobley commented 8 years ago

Yes, let's involve @mrshirts and @bmanubay too, but the immediate stuff we have in mind is:

Short term:

Boldface for highest priority.

Intermediate term:

Long term:

jchodera commented 8 years ago

Great. If you can clean up that list a bit---or maybe put it on a github wiki page?---I can work from there on Wed to craft a proposed API.

The dielectric constant reminds me that we are probably missing some other fluctuation properties. Compressibilities? Bulk modulous? Heat capacities?

Do single point energies belong here too?

The general classes seem to be

Am I missing anything?

jchodera commented 8 years ago

Also, are the activity coefficients all at zero mole fraction of one component, or at finite nonzero mole fraction?

davidlmobley commented 8 years ago

The dielectric constant reminds me that we are probably missing some other fluctuation properties. Compressibilities? Bulk modulous? Heat capacities?

I do have heat capacity on the list. And, what specifically did you want cleaned up?

@bmanubay is working on a fairly extensive list of all properties that might be of interest and how they might be calculated, but we should go ahead with a proposed API ahead of that. Would it be helpful if I gave you a full list of what properties are in the ThermoML data we're looking at? I have that handy.

davidlmobley commented 8 years ago

Also, are the activity coefficients all at zero mole fraction of one component, or at finite nonzero mole fraction?

@bmanubay will have to comment on this. I THINK we're looking at infinite dilution activity coefficients.

bmanubay commented 8 years ago

@davidlmobley @jchodera

The activity coefficients are at infinite dilution of one component. Unfortunately, that is all of the activity coefficient data available from ThermoML.

Also, the pure solvent molar enthalpy in ThermoML refers to enthalpy of fusion.

davidlmobley commented 8 years ago

Hmm, then it sounds like we won't be using that in the near future unless I'm mistaken, as that would involve a crystal structure to calculate, yes?

On Tue, May 24, 2016, 10:47 PM bmanubay notifications@github.com wrote:

@davidmobley @jchodera https://github.com/jchodera

The activity coefficients are at infinite dilution of one component. Unfortunately, that is all of the activity coefficient data available from ThermoML.

Also, the pure solvent molar enthalpy in ThermoML refers to enthalpy of fusion.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/issues/1#issuecomment-221479028

davidlmobley commented 8 years ago

@jchodera - I see your point now about cleaning it up, since it's in flux here. I'll make a wiki page later and send it to you.

I'll also modify hydration free energy to solvation free energy, and add properties that can be derived from that such as relative solubility (same solute, different solvents) and partition coefficient. Log D would be a longer term option.

On Wed, May 25, 2016, 5:06 AM David Mobley dmobley@gmail.com wrote:

Hmm, then it sounds like we won't be using that in the near future unless I'm mistaken, as that would involve a crystal structure to calculate, yes?

On Tue, May 24, 2016, 10:47 PM bmanubay notifications@github.com wrote:

@davidmobley @jchodera https://github.com/jchodera

The activity coefficients are at infinite dilution of one component. Unfortunately, that is all of the activity coefficient data available from ThermoML.

Also, the pure solvent molar enthalpy in ThermoML refers to enthalpy of fusion.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/issues/1#issuecomment-221479028

jchodera commented 8 years ago

Would it be helpful if I gave you a full list of what properties are in the ThermoML data we're looking at? I have that handy.

Yes. If you want to the API to work for something, I need to know what you need it to eventually work for.

jchodera commented 8 years ago

see your point now about cleaning it up, since it's in flux here.

A list on the wiki for this repo would be great. Prioritize by short (immediate), medium, and long-term.

davidlmobley commented 8 years ago

@jchodera - I just gave you the list of everything we think we will calculate, and I'll put that on a wiki in a bit. But I'll also put there a full list of everything we've pulled from the database so you have a more complete idea of everything someone might be interested in.

On Wed, May 25, 2016, 5:50 AM John Chodera notifications@github.com wrote:

see your point now about cleaning it up, since it's in flux here.

A list on the wiki for this repo https://github.com/open-forcefield-group/open-forcefield-tools/wiki would be great. Prioritize by short (immediate), medium, and long-term.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/issues/1#issuecomment-221565625

jchodera commented 8 years ago

I just gave you the list of everything we think we will calculate,

Where?

jchodera commented 8 years ago

Is that the list here plus the additional modifications from subsequent comments? Or did you go back and edit that to reflect the current list?

davidlmobley commented 8 years ago

@jchodera : https://github.com/open-forcefield-group/open-forcefield-tools/wiki/Physical-Properties-for-Calculation

Sorry, I was cooking breakfast for the kids when we exchanged info earlier and had about one minute to respond.

davidlmobley commented 8 years ago

I'm still taking care of morning logistics here but I'll update that wiki page shotrly w list of everything pulled from thermoml.

On Wed, May 25, 2016, 6:42 AM John Chodera notifications@github.com wrote:

Is that the list here https://github.com/open-forcefield-group/open-forcefield-tools/issues/1#issuecomment-221471357 plus the additional modifications from subsequent comments? Or did you go back and edit that to reflect the current list?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/issues/1#issuecomment-221579697

jchodera commented 8 years ago

Some further questions that impact API design:

Data source

Will we use ThermoML as a data source for some, most, or all of our data (that is not QM data)?

Representing molecular species

How did we decide to uniquely represent molecular species? IUPAC, canonical isomeric SMILES, CAS, InCHI, or something else?

Synchronous or asynchronous API?

I presume you want a synchronous API, rather than an asynchronous one, for simplicity. This would mean you call propertyCalculator.computePhysicalProperties(physical_properties, parameter_set) and it will block until all the properties are computed. This way, we can isolate you from whatever is going on in the back-end, which might call out to a cluster or Amazon EC2 to do some reweighting and some new simulations. This is made efficient by virtue of computing all the properties at once, rather than a single property at a time. The alternative is an asynchronous API where you have to keep checking if individual results are ready, which puts much more burden on whoever is using the API.

davidlmobley commented 8 years ago

@jchodera - OK, the wiki page is final as far as I'm concerned.

I reviewed the data from ThermoML listed in https://github.com/open-forcefield-group/open-forcefield-data/blob/master/Pure-Solvents/README.md and https://github.com/open-forcefield-group/open-forcefield-data/blob/master/Binary-Mixtures/README.md and basically the list I already have on the wiki covers everything except a very few things that I believe are straightforward reformulations of properties that are already listed there (i.e. for binary mixtures, excess molar volume should be calculable from the density of the mixture (which we have) and the density of the pure solvents).

davidlmobley commented 8 years ago

@jchodera Answers to your questions from your last post, as best as I can:

Will we use ThermoML as a data source for some, most, or all of our data (that is not QM data)?

Initially, ThermoML will be for all of the non-QM data. Later, I expect it will be for most of the non-QM data (i.e. we might measure our own densities, or start adding log P info), and eventually, only some of our non-QM data.

How did we decide to uniquely represent molecular species? IUPAC, canonical isomeric SMILES, CAS, InCHI, or something else?

The thought was that PropertyCalculator would take an already prepared System, I believe, and we would be using tagged .sdf files for molecular components. However, formats aside, or for the initial starting points - to uniquely identify components we should be using canonical isomeric SMILES. I can explain why if needed, but the short version is that canonical isomeric SMILES is the only reasonable solution aside from InCHI and InCHI is just much harder to work with.

Is the synchronous API reasonable, or do you need asynchronous?

Yes, I think so, though I must admit I've not dealt with the distinction much before. On this end we're initially going to be using our queueing system locally a lot for this, but it seems like that should be able to work fine with this. Is there a major advantage to asynchronous that I'm missing? Otherwise synchronous sounds more straightforward to use.

Would you want to compute multiple properties for multiple parameter sets at once?

I don't see us doing this in the short term, but I'm not seeing much further than that right now so you may actually have better insights into this than me. Remember, you're the "grand vision" guy and I'm the "bite sized pieces" guy. :)

Do we want to eventually be able to compute derivatives of properties with respect to parameters as well? This won't be in the initial implementation, but will impact the design if we allow for this later.

From our discussions, I think people are going to want to do that eventually. I don't see us doing it in the short term but it did come up several times.

mrshirts commented 8 years ago

Sorry for the delay -- home sick today and not quite functioning.

Would you want to compute multiple properties for multiple parameter sets at once?

I think this is something that we may want to.

Do we want to eventually be able to compute derivatives of properties with respect to parameters as well? This won't be in the initial implementation, but will impact the design if we allow for this later.

I think we might want to.

I'll try to respond when needed, though I suspect there will be a lot of sleeping today.

davidlmobley commented 8 years ago

@jchodera - did you want me to Wiki-ify these conclusions as well?

davidlmobley commented 8 years ago

I talked to @bmanubay as there was some confusion about whether enthalpy of vaporization was available. It is, so I've re-added it to the wiki (he had just not pulled it initially) under short-term.

https://github.com/open-forcefield-group/open-forcefield-tools/wiki/Physical-Properties-for-Calculation

I think the list is final.

jchodera commented 8 years ago

@jchodera - did you want me to Wiki-ify these conclusions as well?

No need! This is helpful.

The thought was that PropertyCalculator would take an already prepared System, I believe, and we would be using tagged .sdf files for molecular components.

I'm taking a step back from these implementation details so I can be sure to provide you with the absolute simplest set of functions to do what you need easily, and also make sure we can quickly code up the absolute simplest back-end to get you started but still allow us to scale this up to EC2 or XSEDE resources without you having to change any code.

I think the key here will be to define a few container data structures representing

jchodera commented 8 years ago

Working on drafting some documentation on the proposed API to deliver later today. Let me know if you think of any other properties or system types that aren't currently represented on the wiki.

jchodera commented 8 years ago

(whoops, wrong button!)

davidlmobley commented 8 years ago

@jchodera - thanks. Your involvement here is crucial and I really appreciate it. My big fear was that if we had to go ahead without you I'd do fine with the "code up something simple to get started" end of things but not be able to scale well/handle our larger-scale/longer-term plans. Responses:

I'm taking a step back from these implementation details so I can be sure to provide you with the absolute simplest set of functions to do what you need easily, and also make sure we can quickly code up the absolute simplest back-end to get you started but still allow us to scale this up to EC2 or XSEDE resources without you having to change any code.

Totally agreed, as noted above.

I think the key here will be to define a few container data structures representing

  • the composition of the physical system (neat liquid, mixture, solute in solvent)
  • the thermodynamic conditions at which the measurement was made
  • the type of measurement that was made on a given system at a given set of conditions
  • a set of measurements made on the same system
  • a set of measurements on many systems
  • an interface to compute many properties on many systems using one or more parameter sets that isolates you completely from the back-end reweighting and simulation

I totally agree with all of this, though the last one might be in the "keep it in mind but don't do anything towards that yet".

Another subtlety is that Bryce/Michael tell me the simplest way to calculate some things - like the excess enthalpy of a mixture - is to simulate three systems (pure solution A, pure solution B, and the specified mixture of A and B). So, what would we provide in that case? Specify all three systems but that we want the excess enthalpy of the specified mixture, or provide only the mixture and then PropertyCalculator construct new systems for the pure solutions? Both seem slightly cumbersome - the former requiring that we know how we're going to calculate it, and the latter because now PropertyCalculator has to know something about how the system was constructed or how new systems ought to be constructed. Hmm.

(Let me know if you want me to send the details they just e-mailed on how these would be calculated).

jchodera commented 8 years ago

I totally agree with all of this, though the last one might be in the "keep it in mind but don't do anything towards that yet".

The first version will just run simulations on a cluster and dump the datafiles in a directory, but you will probably be unhappy if you have to rewrite hundreds of lines of code every time we want to make a change to the back-end way we manage data and simulations. The goal here, again, is to make sure the design isolates you from these changes so that you don't have to change anything.

jchodera commented 8 years ago

Another subtlety is that Bryce/Michael tell me the simplest way to calculate some things - like the excess enthalpy of a mixture - is to simulate three systems (pure solution A, pure solution B, and the specified mixture of A and B).

That's OK. Each property will have an idea of the unique system compositions it needs to simulate to compute something. The interface to compute properties will figure out which system compositions are needed and first check which of these have been simulated before doing some reweighting; if more simulation data is required, it will automatically be generated. You won't need to worry about it.

To give you an idea of what I'm thinking of, here's a quick example (subject to change):

# Define the input datasets from ThermoML
thermoml_keys = ['10.1016/j.jct.2005.03.012', ...]
dataset = ThermoMLDataset(thermoml_keys)
# Filter the dataset
dataset.filter(min_temperature=280*unit.kelvin, max_temperature=350.kelvin)
# Load an initial parameter set
parameters = SMARTYParameterSet(initial_parameters_filename)
# Compute physical properties
estimator = PropertyEstimator(nworkers=10)
computed_properties = estimator.computeProperties(dataset, parmeters)
# Write out summary statistics about errors
for (computed, measured) in (computed_properties, dataset):
   property_unit = measured.value.unit
   print('%24s : experiment %8.3f (%.3f) | calculated %8.3f (%.3f) %s' % (measured.value / property_unit, measured.uncertainty / property_unit, computed.value / property_unit, computed.uncertainty / property_unit, str(property_unit))
   # write out error statistics...

This would give us the bare minimum we would need to do Monte Carlo in parameter space, but I'm trying to figure out how sophisticated the computed properties should be (e.g. if we later need to compute derivatives for gradient-based optimization or sampling techniques).

davidlmobley commented 8 years ago

This looks excellent.

One question: On this bit:

thermoml_keys = ['10.1016/j.jct.2005.03.012', ...]
dataset = ThermoMLDataset(thermoml_keys)

does a specific key specify only a specific property? I know a specific ThermoML "data set" in Kroenlein's terminology is a specific data type from a specific publication, but I'm unclear as to whether that's specified by the ThermoML key.

My concern is that we need to retain the ability to:

Perhaps this would all be handled by dataset.filter?

Some publications and data sets for example contain density data for pure solvents AND mixtures, and we will want to be able to test things like seeing how adding mixture data adds additional constraints to the parameters beyond what you get from looking at pure solvents alone. Also, we might want to deliberately look at only a few compounds from a particular data set initially, then include (or test on) others. Or, a data set might consist of some compounds within our initial test set but other compounds which are not, so we need to be able to pick and choose which compounds we look at.

davidlmobley commented 8 years ago

@bmanubay , @mrshirts , other input?

jchodera commented 8 years ago

We can certainly do whatever you feel is most convenient for pulling subsets of data from the ThermoML Archive. The main goal is to make it as convenient as possible to specify the dataset of interest. If you have feedback on how you'd like to specify the dataset, please let me know!

There will be a "low level" API, where you could specify all this data manually, like

composition = BinaryMixture(component1_iupac='water', mole_fraction1=0.2, component2_iupac='methanol')
thermodynamic_state = ThermodynamicState(pressure=500*unit.kilopascals, temperature=298.15*unit.kelvin)
property = ExcessPartialApparentEnergyProp(composition, thermodynamic_state, property_value=83.3863244*unit.kilojoules_per_mole, standard_uncertainty=0.1220794866*unit.kilojoules_per_mole)

but I have a feeling that it would be better to provide you with something more powerful than this if you are going to be ingesting most of your data through the ThermoML Archive.

bmanubay commented 8 years ago

I'd tend to agree with @davidlmobley on the matter of specifying a specific property. The 'filename' keys specify a unique article.

Some articles (luckily) contain data on multiple properties, so I think the dataset.filter would be a good place to handle specification of the property (or properties) as each property has a unique key.

jchodera commented 8 years ago

@bmanubay: Would you prefer to specify the list of properties up front, like

keys = ['10.1016/j.jct.2005.03.012', ...]
dataset = ThermoMLDataset(keys, properties=['ExcessPartialApparentEnergyProp'])

or later in a filtering stage?

keys = ['10.1016/j.jct.2005.03.012', ...]
dataset = ThermoMLDataset(keys)
dataset.filter(properties=['ExcessPartialApparentEnergyProp'])

or perhaps both?

davidlmobley commented 8 years ago

@jchodera : I like the proposed low-level API, that's perfect. And I agree that having something more powerful to pull directly from ThermoML is really going to be closer to what we want at the beginning, as long as we filter appropriately.

As far as filtering up front when we specify the dataset, or via the filter method, it might be helpful to have both options in some cases - i.e. I could imagine using this in a jupyter notebook session to pull up a data set and poke around to see what's in there aside from what I'd intended to look at (what properties or what temperatures) so I might be interested in first specifying the compound and then seeing what temperatures/pressures are there, or in first specifying a temperature and pressure range and seeing what compounds are there. BUT, that of course can be done elsewhere in other ways so it's not strictly necessary. For implementation speed perhaps we should just imagine that filter will take care of everything for now; I think this would still enable the usage I imagine as I could filter either by temperature or by compound or both to narrow in on what I want.

Minor adjustments to/points on the low-level API:

composition = BinaryMixture(component1_iupac='water', mole_fraction1=0.2, component2_iupac='methanol')

(a) Can we just settle on SMILES instead of names? (Or, maybe you're proposing being able to specify by EITHER component1_SMILES or component1_IUPAC, which is OK with me; I just insist on being able to use SMILES because I've wasted too many weeks of my life wrangling non-IUPAC "IUPAC" names) (b) We'll probably go with something more like MixtureSystem rather than BinaryMixture since there's no reason to make this specific to binary mixtures (see for example https://github.com/MobleyLab/SolvationToolkit/issues/53 where you commented previously and I'll be adjusting API as you previously suggested).

bmanubay commented 8 years ago

@jchodera I like the second example with filter as an attribute of dataset. I'll cede to David's suggestion though that having both might be good to begin with.

jchodera commented 8 years ago

(a) Can we just settle on SMILES instead of names?

Take a look at this example ThermoML dataset. Much to my dismay, the system components are specified this way:

  <Compound>
    <RegNum>
      <nOrgNum>1</nOrgNum>
    </RegNum>
    <sCommonName>water</sCommonName>
    <sFormulaMolec>H2O</sFormulaMolec>
    <Sample>
      <nSampleNm>1</nSampleNm>
    </Sample>
  </Compound>
  <Compound>
    <RegNum>
      <nOrgNum>2</nOrgNum>
    </RegNum>
    <sCommonName>methanol</sCommonName>
    <sFormulaMolec>CH4O</sFormulaMolec>
    <Sample>
      <nSampleNm>1</nSampleNm>
    </Sample>
  </Compound>

All we have are (1) the "common names" (water and methanol) and (2) the "molecular formulas" (H2O and CH4O). While we will use OEIUPAC to convert these to OpenEye canonical isomeric SMILES, and we can certainly allow you to specify SMILES inputs, at some point, we have to work with the data we have.

I think this also suggests we should try to convince Ken Kroenlein to augment these datasets with some "safer" ways of specifying compositions using something like canonical isomeric SMILES, CAS numbers, InCHI, or whatnot.

bmanubay commented 8 years ago

@davidlmobley @jchodera Agreed with John. Like I said previously, the component name is the only identifier inherently in the data. Everything else we've been using to sort so far (SMILES, CAS and InChIKey) I had to implement a way to generate.