Initial API outline proposal

jchodera commented 8 years ago

This PR begins the process of documenting the API proposed in https://github.com/open-forcefield-group/open-forcefield-tools/issues/1

See the README.md for the proposed API.

jchodera commented 8 years ago

@pgrinaway and @maxentile: You two will presumably be responsible for implementing the Bayesian posterior sampling techniques. Can you take a look at the proposed API in the README.md in this PR and see if this would work for you?

davidlmobley commented 8 years ago

Minor etiquette question - if I want to propose very minor tweaks, should I be commenting here or should I be submitting a PR to your fork?

In any case, here are my comments.

REGARDING MIXTURES: In general, this looks good. It's very consistent with what we are doing in SolvationToolkit, though this focuses on setting up a SPECIFIC system (a MixtureSystem) with a number of molecules that the user defines or which it estimates. Here, we're thinking about how to define a mixture which as yet has an undefined number of molecules.

You ask:

Should we instead allow the user to specify one molecule of the solute, as in infinite_dilution = Mixture() infinite_dilution.addComponent('phenol', number=1) # one molecule infinite_dilution.addComponent('water')

My answer is "no". In SolvationToolkit, we do allow specification of number but it makes more sense there since the system already consists of a particular number of molecules. Here I don't think it makes sense, as the ONLY number which would make sense is number = 1.

So, my answer is that we would just say that a Mixture consists of a specified set of components which have mole fractions which are floating point numbers, with special cases of 0.0 (meaning infinite dilution) and unspecified, meaning "the remainder of the system".

\ To answer your questions: **

Is the concept of Mixture sufficiently general that we don't need further types of substances?

In my view, yes. The one concern I would have would be if we wanted to look at dimer formation or something. But, I don't see that we would have access to data on that in mixtures, so I'm not worried about that now. Membrane permeability would be another issue, but again, I think those would require a rather different framework. So, this sounds good to me as is.

Is it OK that we don't specify the total number of molecules in the substance, and only specify the fractional composition? We would have to specify the total number of molecules in the PropertyCalculator instead.

This is OK with me. I'm not totally clear on the incentive for splitting things this way, other than that you're relating this to the experiment, so the number of molecules would not yet be relevant to simulation.

* THERMODYNAMIC STATES *

Agreed, and using simtk.unit sounds like the best option for now to me, especially since the other omnia tools we'll be involving seem to mostly use that. Switching would be undue headache for no good reason, IMO.

\ MEASUREMENT METHODS **

I'm not totally sure where this is going, and have some concerns there are going to be too many possibilities here, but this does seem to be a good framework and this info will be good to hang on to.

* PHYSICAL PROPERTY MEASUREMENTS *

This looks beautiful!

Should there also be a .notes property for PhysicalPropertyMeasurement? i.e., one might want to annotate a measurement if questions are raised about it (pointing out an alternate reference which disagrees, etc.). Or maybe .XMLnotes? Then we would have some flexibility to later add tags which would allow for machine parsing of notes so that we could, for example, flag particular measurements as questionable, add references providing contrary data, and so on? My limited experience with FreeSolv and with SAMPL is that people inevitably point out issues with measurements and we need some way of maintaining a connection between these and the measurements themselves to facilitate further study when needed. SOME sort of notes field would at least provide room for us to later systematize how to do this.

(No comments on some intervening sections here)

\ ESTIMATING PROPERTIES **

I thought we said that PropertyCalculator was the main thing that would calculate properties, and it might call out to PropertyEstimator (very similar API) when it decides reweighing is appropriate?

It seems like at this point you're not providing any info about what this estimator/calculator would return, which is probably key. Obviously it would return a value and associated uncertainty, but should it also provide details about how this was obtained, such as what length of simulation was used and so on, i.e. via a CalculationMethod subclass which mirrors ExperimentMethod? And also perhaps details of where the data is stored?

* API USAGE *

Looks very nice.

davidlmobley commented 8 years ago

I was thinking about this more, and had to look back at it to confirm that you'd thought through the case where we might want to group multiple physical property calculations into a single call to PropertyCalculator or PropertyEstimator - but you did, which is excellent.

The only other thing that has come to mind is what about the case where we have archived simulation data that would be utilized by PropertyCalculator and/or PropertyEstimator. What are your thoughts, @jchodera , on how that would fit into this framework? Should there be another class, ArchivedSimulations, or something, that would perhaps be provided when initializing a PropertyEstimator that gives access to any archived results that might or might not be relevant?

davidlmobley commented 8 years ago

@mrshirts and @bmanubay should also have a look since they will certainly be involved.

bmanubay commented 8 years ago

@davidlmobley brings up a good point with the archived simulation data. I personally like the idea of having ArchivedSimulations as a function of whatever might be the most appropriate class, i.e:

import PropertyEstimator

dat = PropertyEstimator.ArchivedSimulations(path, filename, **kwargs)

I guess if the archived simulation would be utilized by multiple classes I could see it being an independent class then.

jchodera commented 8 years ago

I was thinking about this more, and had to look back at it to confirm that you'd thought through the case where we might want to group multiple physical property calculations into a single call to PropertyCalculator or PropertyEstimator - but you did, which is excellent.

@davidlmobley : The bigger question is whether you need to compute the observables with respect to multiple parameter sets in a single call. As of now, this capability is not available in the draft API.

The only other thing that has come to mind is what about the case where we have archived simulation data that would be utilized by PropertyCalculator and/or PropertyEstimator. What are your thoughts, @jchodera , on how that would fit into this framework? Should there be another class, ArchivedSimulations, or something, that would perhaps be provided when initializing a PropertyEstimator that gives access to any archived results that might or might not be relevant?

The API was designed such that you do not need to worry about this. All of this is hidden from you. You will not know (or need to specify) whether new simulations are run or existing simulations are reweighted to give you new property estimates---all that is done under the hood to give us maximum flexibility to experiment with making that efficient and transparent. We may add optional parameters that give you some control over the error tolerance allowed for reweighting before a new simulation is run, but by default, you will not need to worry about it.

The initial implementation will do something very simple:

Store each simulation in a separate NetCDF file
Initially attempt to reweight everything from existing data
For properties where the error is large, new simulations will be run on a cluster

Because you're asking for a bunch of properties at a time, this process can be made reasonably efficient.

Later on, we can get very fancy about how we handle the scheduling, data management, reweighting, cloudbursting, use of XSEDE resources, etc. But for now, I can implement something simple, and the same API will still work later on as we speed things up on the backend.

jchodera commented 8 years ago

Minor etiquette question - if I want to propose very minor tweaks, should I be commenting here or should I be submitting a PR to your fork?

Either way! If you want to submit a PR to my fork, that is a quick way to make changes. But I'm happy to just work through your suggestions today if that is easier.

jchodera commented 8 years ago

@pgrinaway and @maxentile: Last chance for you to give comments here.

davidlmobley commented 8 years ago

@jchodera : Everything sounds great. Just wanted to follow up on this:

The bigger question is whether you need to compute the observables with respect to multiple parameter sets in a single call. As of now, this capability is not available in the draft API.

This sounds good to me as is.

and on this:

The API was designed such that you do not need to worry about this. All of this is hidden from you. You will not know (or need to specify) whether new simulations are run or existing simulations are reweighted to give you new property estimates---all that is done under the hood to give us maximum flexibility to experiment with making that efficient and transparent. We may add optional parameters that give you some control over the error tolerance allowed for reweighting before a new simulation is run, but by default, you will not need to worry about it.

I totally agree that we don't want to know or specify whether more simulations are run. I was just asking whether, in the interests of not worrying about this, we should have a component of the API that tells the interface where it ought to look to check and see what data is already available.

I'm just imagining a scenario where, say, I'd run some big batch of calculations initially using a local cluster and I want to migrate to Amazon or your cluster or a different local cluster. Precisely in the interests of not worrying about this I'd like some way to make sure that any calls I make to the property calculator in the new environment know about the old simulations so I don't end up spending a huge amount of time re-running things that are already available. As you note, I don't want to have to think about it, provide info on what has already been run, etc., but I DO want to have a simple way to *ensure that PropertyEstimator has access.

Maybe in your view migration to a new cluster or the cloud would mean something more like one of the following:

copying a set of files from one cluster to the other in a pre-defined directory structure
applying some other MigrateData tool which would migrate the directory structure or something along those lines?

Summarizing: I'm not after worrying about how data is archived, but I am after ensuring I can migrate existing data between environments without having to worry about how it is archived. It seems like this means that EITHER we have to tell PropertyEstimator where to look, OR we're saying we'll have a predefined directory structure which it will look in. Otherwise portability would be impossible, presumably.

jchodera commented 8 years ago

I totally agree that we don't want to know or specify whether more simulations are run. I was just asking whether, in the interests of not worrying about this, we should have a component of the API that tells the interface where it ought to look to check and see what data is already available.

The PropertyEstimator() constructor will presumably take some arguments that specify what resources to bind to (such as the cluster to connect to, the EC2 account to use, the name of the directory or database holding data, etc.). Those are yet to be worked out. But this information would not be passed during the estimator.computeProperties(dataset, parameters) call, and you would only want to create one PropertyEstimator() to use throughout many such calculations.

'm just imagining a scenario where, say, I'd run some big batch of calculations initially using a local cluster and I want to migrate to Amazon or your cluster or a different local cluster. Precisely in the interests of not worrying about this I'd like some way to make sure that any calls I make to the property calculator in the new environment know about the old simulations so I don't end up spending a huge amount of time re-running things that are already available. As you note, I don't want to have to think about it, provide info on what has already been run, etc., but I DO want to have a simple way to *ensure that PropertyEstimator has access.

I think discussing explicit data migrations strategies from one backend to another is premature. We'll find some way to handle this when we come to it. We could, for example, separate the database server that holds the simulation data from the compute resource that runs simulations or performs reweighting tasks. We could later extend the API to allow multiple additional database servers to be specified, where lots of public simulation could be cached for other people to make use of, e.g.

# Attach to my compute and storage resources
estimator = PropertyEstimator(compute_resource='me@mycluster.myinstitute.org', nworkers=10, storage='my.database.server')
# Add (read-only) access to a public simulation database
estimator.addSimulationDatabase(server='simdata.openforcefieldgroup.org') 
# Now estimate some preoperties:
computed_properties = estimator.computeProperties(dataset, parameters)

Or we could provide another migration API to let you migrate your data to the cloud prior to spinning up an EC2 instance.

But let's not worry about this now. We're getting ahead of ourselves, I think.

davidlmobley commented 8 years ago

But let's not worry about this now. We're getting ahead of ourselves, I think.

This all sounds good. I was just trying to make sure this issue wouldn't require the API to be different in some way, but it sounds like it doesn't. So, great!

jchodera commented 8 years ago

Some other responses:

Measurement methods are just there to help PhysicalPropertyMeasurement decide which error model to use, since this may depend on the measurement technique. We could replace this by a string argument if we want to keep things simpler.

PhysicalPropertyMeasurement notes: How would the API for this look? Where would the notes come from? What would consume this information?

We could allow the original XML data associated with a measurement to be accessed as a dict-of-dicts via measurement.xml or something, if these would be stored in the ThermoML file. Or we could add a catch-all dict called annotations or something. Would this just be a way to annotate information while in memory? Or do you envision this somehow being used in the error model or something?

I thought we said that PropertyCalculator was the main thing that would calculate properties, and it might call out to PropertyEstimator (very similar API) when it decides reweighing is appropriate?

The PropertyCalculator is not exposed because it will likely be hidden within PhysicalPropertyMeasurement, which will define how to compute the measured physical property from a simulation dataset.

It seems like at this point you're not providing any info about what this estimator/calculator would return, which is probably key. Obviously it would return a value and associated uncertainty, but should it also provide details about how this was obtained, such as what length of simulation was used and so on, i.e. via a CalculationMethod subclass which mirrors ExperimentMethod? And also perhaps details of where the data is stored?

Good point. What information is essential here? Can you enumerate the information you would like to access?

jchodera commented 8 years ago

@davidlmobley brings up a good point with the archived simulation data. I personally like the idea of having ArchivedSimulations as a function of whatever might be the most appropriate class, i.e:
import PropertyEstimator
dat = PropertyEstimator.ArchivedSimulations(path, filename, **kwargs)
I guess if the archived simulation would be utilized by multiple classes I could see it being an independent class then.

Is it critical that you be able to access archived simulation data directly? If so, what would you need to obtain?

My goal with the API was to hide all that complexity from you so that we could do that "under the hood" in an efficient manner.

davidlmobley commented 8 years ago

@jchodera, responses follow:

PhysicalPropertyMeasurement notes: How would the API for this look? Where would the notes come from? What would consume this information?

So now that I think about it more, I realize my idea doesn't make sense. Really, what I was describing is something that would make sense if the database were ours and we were accessing it so that we had the ability to deposit notes/annotations back into the database permanently if we wanted to (as in FreeSolv) but we don't. So, really, there should be no "notes" field at present, and this is all up to the user - if we have concerns with the database and we think NIST should add warnings to something or whatever, we will have to follow up with NIST.

So, I retract my suggestion of a "notes" field. It makes no sense.

It seems like at this point you're not providing any info about what this estimator/calculator would return, which is probably key. Obviously it would return a value and associated uncertainty, but should it also provide details about how this was obtained, such as what length of simulation was used and so on, i.e. via a CalculationMethod subclass which mirrors ExperimentMethod? And also perhaps details of where the data is stored?

Good point. What information is essential here? Can you enumerate the information you would like to access?

Probably we want to start with something minimal (text string describing basic protocol, simulation length, etc.) but have this extensible as Bruce Ross develops his "yaml for preparation/simulation protocols" idea. Ideally you might want something that would let you read the whole simulation and analysis protocol and then hand it off to someone who might want to try a variant of it (i.e., "look, undergrad in the lab - we got this value using this protocol, but what would happened if you tried that with enhanced sampling method XYZ", or the same property with an alternate forcefield functional form, parameter set, or whatever).

So, in summary: Right now it just should exist and be super minimal, probably containing a plain text protocol description or something like that, but ultimately it would provide a protocol input that could be used to reproduce the exact calculation/analysis or, by editing, to tweak/modify the exact calculation.

davidlmobley commented 8 years ago

Information I would like to access at this point would primarily be: Simulation length, length of equilibration, integrator, timestep, etc. Basically just the types of things reviewers would ask you if you write a paper. :) So, a human-readable description of the protocol.

davidlmobley commented 8 years ago

My goal with the API was to hide all that complexity from you so that we could do that "under the hood" in an efficient manner.

I actually don't think this is necessary beyond what you've described. The complexity can be hidden as you propose.

jchodera commented 8 years ago

L

Information I would like to access at this point would primarily be: Simulation length, length of equilibration, integrator, timestep, etc. Basically just the types of things reviewers would ask you if you write a paper. :) So, a human-readable description of the protocol.

I need to push you a bit more on this to figure out how to meet your needs. (Not trying to be difficult!)

Is this information all required? For example, if this property was estimated partly entirely from reweighting, do you want information about all of the simulations used to generate the estimate? How would you want that information organized? Perhaps there should be a SimulationData container that holds this information? What if some new data was generated? How should this be denoted?
What will you do with this information? Does it need to be human-readable, computer-readable, or both?
How important is this? Do we need a solution for this now, or can we define how this information is accessed later?

jchodera commented 8 years ago

Sorry, just saw the previous comment!

I think we are both thinking about simulations very differently here. We need to move away from the idea that a simulation is an end product produced by someone's labor. Instead, a simulation is an ephemeral intermediate that is used to produce a property estimate. The simulations may be cached for efficiency, but if you blow them away, PropertyEstimator won't care---it will just regenerate the simulation data it needs to tackle the job at hand.

It really makes it a lot easier if you never know or need to know what simulation data is available or how it was generated. We can add that for diagnostic and tuning purposes, but imposing the need to retrieve very specific information about incidental data is like insisting that Keynote give you a ton of information about the intermediate files it generates as you create a presentation; it just isn't needed (though it may help someone optimize Keynote under the hood).

mrshirts commented 8 years ago

It really makes it a lot easier if you never know or need to know what simulation data is available or how it was generated.

But for the near future, we do need to be able to know how it was generated. What if a certain simulation was generated with a version of OpenMM that was later found to have a bug? Being able to see which stored simulations had that bug and erase those would be important . . .

But it's not the need to do it, it's the ability to do it if desired.

jchodera commented 8 years ago

But for the near future, we do need to be able to know how it was generated.
What if a certain simulation was generated with a version of OpenMM that was later found to have a bug?

We would delete all the data and start over. Again, the simulation data is ephemeral. It may grow to many terabytes in size, and most of the data may be irrelevant for the region of parameter space that we eventually converge to. Regenerating data of relevance to the current or final parameters may not take that long since it may represent only a very small fraction of the total dataset size.

But it's not the need to do it, it's the ability to do it if desired.

We can certainly add to the API to expose more information about the data later. I'm asking what do we need to build in now?

Remember, anything we add at this point will slow us down in producing a first prototype that actually does something useful. I'm asking for precise details on

Exactly what information do you need access to now?
How do you want to access it?
What will you do with it?

If the answer is "we may need this data at some point, but I can't answer these three questions in detail now", that's fine---we can extend the API later to let you get the info you need.

jchodera commented 8 years ago

Otherwise we have to wait for you to detail the answers to those questions, and nothing gets done until that's finalized...

davidlmobley commented 8 years ago

@jchodera - I'm sort of losing track of which comment is a response to what, since the one beginning:

Sorry, just saw the previous comment! didn't explain which previous comment.

But let me take these out of order and quote. I THINK the following two comments were talking about whether the API lets us deal with the "what simulations do we have and where" issue, so I'll respond to them in that vein:

I think we are both thinking about simulations very differently here. We need to move away from the idea that a simulation is an end product produced by someone's labor. Instead, a simulation is an ephemeral intermediate that is used to produce a property estimate. The simulations may be cached for efficiency, but if you blow them away, PropertyEstimator won't care---it will just regenerate the simulation data it needs to tackle the job at hand.

I agree with that for most of what we're thinking about at present, yes. However, I want to remain sensitive to the fact that the simulations may require a lot of simulation time. But as I said - you've convinced me the issue of what simulation data we have and where it lives all should be hidden from the user and not something we're having to deal with through the API.

It really makes it a lot easier if you never know or need to know what simulation data is available or how it was generated. We can add that for diagnostic and tuning purposes, but imposing the need to retrieve very specific information about incidental data is like insisting that Keynote give you a ton of information about the intermediate files it generates as you create a presentation; it just isn't needed (though it may help someone optimize Keynote under the hood).

This I generally agree with, though I note that we are also in the business of "optimizing Keynote" as it were.

Now, back to the SimulationData issue. I'm really just asking for something that reports how the property was calculated. If I imagine being a user of this but not a developer, there are probably all kinds of problems I can get into if I have no way of figuring out how a property was calculated/estimated. So:

Information I would like to access at this point would primarily be: Simulation length, length of equilibration, integrator, timestep, etc. Basically just the types of things reviewers would ask you if you write a paper. :) So, a human-readable description of the protocol.

I need to push you a bit more on this to figure out how to meet your needs. (Not trying to be difficult!)

To some degree, I think the issue I'm raising is the same one you're raising with Bryce and Michael relating to how to estimate different properties. There are a variety of ways of estimating some of these properties. Can we really expect people to use this framework and publish with it if we give them no way (aside from reading the source code) of determining how the properties they calculate were actually obtained? And would the reviewers be happy with that? I'm guessing the answer is "no".

I think we can mostly bypass this in the "early days", i.e. this summer and soon thereafter, since it will be just us and we'll know what's going on. But eventually, we hope, people will be using it just as a black box.

Is this information all required? For example, if this property was estimated partly entirely from reweighting, do you want information about all of the simulations used to generate the estimate? How would you want that information organized? Perhaps there should be a SimulationData container that holds this information? What if some new data was generated? How should this be denoted?

I addressed the short-term/long-term aspects in an earlier note where I talked about the Greg Ross/YAML type thing for reproducibility. I think short-term it could be just a set of text that states even something as simple as what functions were executed, and if archived data was used, what data was used. Yes, a SimulationData container holding this info would be fine, but it could be super minimal (basically just a doc string as far as I'm concerned!) initially.

What will you do with this information? Does it need to be human-readable, computer-readable, or both?

Initially, human-readable. Long term as described above (see YAML/Greg Ross comments) I could see having it be both.

How important is this? Do we need a solution for this now, or can we define how this information is accessed later?

I think right now it will be helpful for us to have something like a text log coming back to us with info on what was run, and that it will help with debugging, etc. I'm imagining cases where we might see that something works super efficiently as long as we stay within some particular temperature range, and that it takes way longer if we move outside that, and we wonder why... being able to inspect a log of how the property was calculated would answer that. Otherwise we know nothing other than that it just took longer.

If you think it's a terrible idea, I can certainly be talked out of it. I just don't see the harm in returning some info on how the properties were calculated, and I don't see it being a huge headache for the functions called to basically return some basic "logging" type info (initially, that's what this would mean) that would ultimately get passed back.

I don't imagine initially using it for anything programmatically, just looking at it from time to time when I want to understand what happened. But again as noted above I can see long-term there would be some programmatic uses for something more detailed.

davidlmobley commented 8 years ago

OK, so let me summarize that by responding just to this one thing that came in while I was writing:

If the answer is "we may need this data at some point, but I can't answer these three questions in detail now", that's fine---we can extend the API later to let you get the info you need.

1) Exactly what information do you need access to now? 2) How do you want to access it? 3) What will you do with it?

1) Just a plain text log of what functions were executed on what data would be fine for now. Later we could extend that. 2) Returned text would be fine 3) Look at it if I need to understand how something was calculated and it's difficult to tell (i.e, if I have no idea what data is archived, there's no way I could figure out, if I needed to, whether a property is coming for reweighting or not. I won't ordinarily need to know, but what if something goes wrong and I DO need to know?)

jchodera commented 8 years ago

1) Just a plain text log of what functions were executed on what data would be fine for now. Later we could extend that. 2) Returned text would be fine 3) Look at it if I need to understand how something was calculated and it's difficult to tell (i.e, if I have no idea what data is archived, there's no way I could figure out, if I needed to, whether a property is coming for reweighting or not. I won't ordinarily need to know, but what if something goes wrong and I DO need to know?)

None of these responses are anywhere close to being detailed enough for me to know what to provide. All you're saying is "I want some information on how every thing was calculated". How much information do you want? What kind specifically? What are you going to do with it?

Can you really "look at it" if I give you 14GB of data? Or the raw XML files for the serialized OpenMM System objects used to generate the data? Or the serialized Integrator that was used to sample from it?

The request to retrieve information---especially as a text log---is just not well-formulated.

I think the problem here is that I'm trying to lay down a division between things inside and outside of PropertyEstimator. Inside PropertyEstimator, I have access to all sorts of stuff---even the raw NetCDF simulation files sitting on the filesystem for the initial implementation---that help me optimize execution and catch bugs. We'll deal with this later.

What I'm asking right now is what do you need to access on the outside of PropertyEstimator to do the science? I've given you what I think is needed---and what I think @maxentile and @pgrinaway can construct a sampler with. But you're telling me that you want to peer under the hood of PropertyEstimator and poke around at the innards.

We're trying to avoid "spaghetti code" here, so this isn't allowed. Unless you can articulate the specific information you want the API to provide you---information that will not depend on the implementation details we end up using "under the hood" if we switch around MD engines or change data file format or use cloud servers or store things in a database---then I can't actually expose this information without causing us huge problems from the start.

davidlmobley commented 8 years ago

OK. Just declare it fine as is and we can figure out how to get at this later if we need it.

jchodera commented 8 years ago

To some degree, I think the issue I'm raising is the same one you're raising with Bryce and Michael relating to how to estimate different properties. There are a variety of ways of estimating some of these properties. Can we really expect people to use this framework and publish with it if we give them no way (aside from reading the source code) of determining how the properties they calculate were actually obtained? And would the reviewers be happy with that? I'm guessing the answer is "no".

Bryce and Michael are doing exactly the right thing by documenting the "best practices" that we will adhere to in the property calculations. One we fix the thermodynamic relationships we will estimate, we will also fix the behavior (what integrators will be used, how we will reweight to estimate properties, what error threshold triggers new simulations)---this will all become part of the specification to which the code will adhere. We just can't guarantee where the data will live, what format it will be in, how much there is, or many other implementation details.

If you guys end up developing a solid idea of precisely which pieces of information are needed, then we can expose this information as we go along. This would have to be concrete questions like:

How many simulations using a particular MixtureSystem object are there?
What ThermodynamicStates were those simulations run at?
What were the simulation lengths for simulations using a particular MixtureSystem and ThermodynamicState?

These are all things that are specific enough pieces of information---but general enough that they will not change when the backend is changed around---that they could be added to the API.

davidlmobley commented 8 years ago

So it sounds like you really feel strongly that there should be no CalculationMethod subclass which would mirror MeasurementMethod until we can enumerate exactly what pieces of information would be contained in this and how. That's fine, and we can proceed without that for now.

I'm just slightly confused as to why it's acceptable to leave the MeasurementMethod subclass defined in a rather similar way to what I'm proposing:

A MeasurementMethod subclass has information specific to the particular method used to measure a property (such as experimental uncertainty guidance).

Some examples:

FlowCalorimetry for HeatCapacity or ExcessMolarEnthalpy VibratingTubeMethod for MassDensity

I was really just proposing that we have something mirroring that here - i.e. if we calculate a MassDensity we get a EquilibriumMDSimulation or ReweightedMDSimulation set of info back, which would contain details we would define when we get to them in just the same way you've treated the MeasurementMethod.

If all of this has to be completely defined in advance, then I don't understand why you don't have dramatically more details for the ExperimentMethod subclass. Should we be removing that as well? You've not enumerated what details these would contain about the relevant experiments.

davidlmobley commented 8 years ago

In other words, I was just trying to say that if you're preparing to store ExperimentMethod details in this way in case we need them in the future, then it seems like we should be doing the same with CalculationMethod details, and I see myself having more immediate use of the latter (though as you note, I don't have a specific use case in mind right now!)

Anyway, your call. If you think it makes sense to have ExperimentMethod vaguely defined for now but keep it, and NOT have CalculationMethod, then that's fine. The important thing at this point is that we finalize this and get going.

jchodera commented 8 years ago

So it sounds like you really feel strongly that there should be no CalculationMethod subclass which would mirror MeasurementMethod until we can enumerate exactly what pieces of information would be contained in this and how. That's fine, and we can proceed without that for now.

There might be a CalculationMethod, but I don't know what its API should look like yet, so I'm hesitant to expose any of it yet. Again, if you can enumerate precisely what its API should look like (or what information specifically it needs to provide), I can do that! I just don't know what it should do, and asking it to "return some text describing what was done" is essentially useless to work from.

I'm just slightly confused as to why it's acceptable to leave the MeasurementMethod subclass defined in a rather similar way to what I'm proposing:

I haven't defined any methods for these classes yet. They're just objects whose class type tells PhysicalPropertyMeasurement what kind of measurement techniques was used at this point. They could just as easily be an enumerated type (if those existed in Python) or a string (which is always a bit more dangerous). They might have useful methods later, but I'm not sure what those would be.

We could similarly have a HeatCapacityViaEnergyVariance method, or a HeatCapacityViaTemperatureAnalyticalDerivative method, but my understanding of the Best Practices document was to pick only one best practice and adhere to that specification in our code.

I was really just proposing that we have something mirroring that here - i.e. if we calculate a MassDensity we get a EquilibriumMDSimulation or ReweightedMDSimulation set of info back, which would contain details we would define when we get to them in just the same way you've treated the MeasurementMethod.

Each ComputedProperty may have used a single simulation or hundreds of simulations. The property will always be computed by reweighting some set of simulations. I can give you a list somehow representing the simulations that were reweighted, but I don't know what information you want to be able to access from them, so they would be empty objects.

More importantly, we could always add ways to get information about these simulations to ComputedProperty later. It's a container, and we can add to its API so it contains more things. If you can think of specific things it should contain, I can add them later! But "returned text" does not tell me anything about what you want to know, and "I need to understand how something was calculated" is not going to be helpful because we will have a specification document that describes the unique exact procedure how physical properties are estimated via reweighting; beyond that, I'd need to now more specifically what you mean by "how something was calculated".

jchodera commented 8 years ago

I'll add that we may eventually want to have multiple types of simulations that are used---you mentioned the possibility of alchemical simulations in addition to equilibrium simulations. But the property calculator would always use only one type or only another type, presumably, so it may not be useful to return an set of empty AlchemicalSimulation objects vs a set of empty EquilibriumSimulation objects at the moment.

jchodera commented 8 years ago

An example of what I'm suggesting you guys give me (if you need/want it) is enough for me to give you a specific API like

# Attach to my compute and storage resources
estimator = PropertyEstimator(...)
# Estimate some properties
computed_properties = estimator.computeProperties(dataset, parameters)
# Get statistics about simulation data that went into each property
for property in computed_properties:
   # Get statistics about simulation data that was reweighted to estimate this property
   for simulation in property.simulations:
      print('The simulation was %.3f ns long' % (simulation.length / unit.nanoseconds))
      print('The simulation was run at %.1f K and %.1f atm' % (simulation.thermodynamic_state.temperature / unit.kelvin, simulation.thermodynamic_state.pressure / unit.atmospheres))
      # Get the ParameterSet that was used for this simulation
      parameters = simulation.parameters
      # what else do you want...?

davidlmobley commented 8 years ago

OK, so your response just solved my problem and now we're on the same page. The issue was that you were imagining the computeProperties function of the PropertyEstimator returning a ComputedProperty object, but your API spec does not say this! The posts you JUST made were the first time you ever mentioned a ComputedProperty object. So this:

More importantly, we could always add ways to get information about these simulations to ComputedProperty later. It's a container, and we can add to its API so it contains more things. If you can think of specific things it should contain, I can add them later! But "returned text" does not tell me anything about what you want to know, and "I need to understand how something was calculated" is not going to be helpful because we will have a specification document that describes the unique exact procedure how physical properties are estimated via reweighting; beyond that, I'd need to now more specifically what you mean by "how something was calculated".

completely solves my problem! Now you've finally made clear that this returns a ComputedProperty object, which could be extended if needed to give access to things about the calculation which I might need to know.

I'll remind you, this originally came up because I pointed out that your API draft doesn't say anything about what computeProperties would return:

It seems like at this point you're not providing any info about what this estimator/calculator would return, which is probably key. Obviously it would return a value and associated uncertainty, but should it also provide details about how this was obtained, such as what length of simulation was used and so on, i.e. via a CalculationMethod subclass which mirrors ExperimentMethod? And also perhaps details of where the data is stored?

My concern was that if you only return a value and an uncertainty, then there is no way to extend this to give access to any additional information we might ever decide we need without breaking everything. But instead you're saying it's not returning a value and an uncertainty, it's returning a ComputedProperty.

So, OK, great! Now we're in agreement. The draft API just needs updating so you make this clear. :)

jchodera commented 8 years ago

OK, so your response just solved my problem and now we're on the same page. The issue was that you were imagining the computeProperties function of the PropertyEstimator returning a ComputedProperty object, but your API spec does not say this!

Sorry for being unclear here. estimator.computeProperties(dataset, parameters) returns a list of container objects that have---at minimum---a .value and .uncertainty property, as listed in the example:

# Write out statistics about errors in computed properties
for (computed, measured) in (computed_properties, dataset):
   property_unit = measured.value.unit
   print('%24s : experiment %8.3f (%.3f) | calculated %8.3f (%.3f) %s' % (measured.value / property_unit, measured.uncertainty / property_unit, computed.value / property_unit, computed.uncertainty / property_unit, str(property_unit))

We can always add other properties to these objects, like information about the simulations they were derived from. I'll make that explicit in the API docs.

jchodera commented 8 years ago

Updated! Changes are here.

My concern was that if you only return a value and an uncertainty, then there is no way to extend this to give access to any additional information we might ever decide we need without breaking everything. But instead you're saying it's not returning a value and an uncertainty, it's returning a ComputedProperty.

Yup! This is clarified now.

Two remaining questions:

What do you think of the names PhysicalPropertyMeasurement and ComputedPhysicalProperty? Should these instead be MeasuredPhysicalProperty and ComputedPhysicalProperty for symmetry?
We're 100% sure about not needing to compute the properties for multiple parameter sets in a single call, is that right @maxentile and @pgrinaway? It's harder to go back if we change our minds later.

davidlmobley commented 8 years ago

As far as I'm concerned the API can now be finalized given the changes you just made.

My only concern was just that we not get stuck in a world where computeProperties ONLY gives us back a measured value and an uncertainty and we can't extend it without breaking everything when, down the line somewhere, I run into a scenario where for whatever reason I really need to be able to access some other set of information to do something else I'm not yet anticipating. When you started asking for me what info I needed right now I got really confused (and it was hard to be specific!) since I was just trying to argue that we need to allow it ultimately to be able to return more than just a value and an uncertainty. Sorry for all the confusion!

davidlmobley commented 8 years ago

What do you think of the names PhysicalPropertyMeasurement and ComputedPhysicalProperty? Should these instead be MeasuredPhysicalProperty and ComputedPhysicalProperty for symmetry?

I'd prefer the symmetry of the latter. Physicists are big on symmetry. :)

Regarding the issue of multiple parameter sets - I don't have enough expertise here. But how were you imagining making the MC sampling efficient? Wasn't there something where you can make it much more efficient if you can evaluate acceptance probabilities to multiple states all at once? Just asking - not pushing one way or another, as this is not my expertise.

maxentile commented 8 years ago

Hi!

Regarding support for multiple simultaneous evaluations, I don't have a strong opinion, since I don't have a good sense of how this choice will influence subsequent design decisions. From what John's mentioned though, it seems like it will be easy to implement up front, but hard to implement later, so it may be worth including initially for flexibility.

There are a few possible algorithms on our radar that could exploit this feature to achieve at best a linear speed-up in the number of simultaneous calls.

If we use Multiple-Try Metropolis, all the proposals within a single step could be parallelized.
If we use the affine-invariant ensemble sampler (implemented in emcee), we can make use of multiple simultaneous property queries.
If we are ever interested in globally optimizing the likelihood (or posterior) function to get a point estimate, we can use a batch bayesian optimization algorithm. A sequential bayesian optimization algorithm will make fewer property queries, but a batch algorithm may take less wall time, if each property query is expensive.
If we want to be able to update our current posterior estimate with a new type of physical property / experimental data, maybe a convenient way to do that will be to have a separate sub-posterior for each property and merge them later (as in http://arxiv.org/abs/1605.08576). I think this would already be supported by the current API proposal, but might be faster if multiple simultaneous evaluations were supported.
Also, this feature would be useful for few-parameter model systems where exhaustive evaluation of a grid of parameter values is feasible.

jchodera commented 8 years ago

Thanks, @maxentile! This is very helpful!

mrshirts commented 8 years ago

We're 100% sure about not needing to compute the properties for multiple parameter sets in a single call, is that right @maxentile and @pgrinaway? It's harder to go back if we change our minds later.

We do want to be clear about the distinction between "compute" and "estimate". IIRC, we definitely want to be able to estimate multiple parameters at once, and have always planned to. However, @maxentile brings up a good point -- if we run simulations that sample in parameters themselves during the simulation and not just sample configurations, we would want to be able to compute properties at all of those parameters.

jchodera commented 8 years ago

I'm not making a distinction between "compute" and "estimate". When you call

computed_properties = estimator.computeProperties(dataset, parameters)

it will first try to estimate all properties from existing cached simulation data. If it fails to meet error thresholds for some properties, new simulations will be triggered, and the call will block until the error thresholds can be met. It will block even if just one simulation needs to be run, or a hundred.

By allowing more work to be done at this time---by computing more properties or at more parameter sets---you can avoid the latency of multiple calls because more parallelization can occur.

davidlmobley commented 8 years ago

I agree with John here - we'll just ask it to compute, and then it will decide whether that means calculate fresh, estimate, or something in between.

Seems like given the comments regarding multiple parameter sets, we want to include that option.

On Wed, Jun 8, 2016 at 11:09 AM, John Chodera notifications@github.com wrote:

I'm not making a distinction between "compute" and "estimate". When you call

computed_properties = estimator.computeProperties(dataset, parameters)

it will first try to estimate all properties from existing cached simulation data. If it fails to meet error thresholds for some properties, new simulations will be triggered, and the call will block until the error thresholds can be met. It will block even if just one simulation needs to be run, or a hundred.

By allowing more work to be done at this time---by computing more properties or at more parameter sets---you can avoid the latency of multiple calls because more parallelization can occur.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-forcefield-group/open-forcefield-tools/pull/3#issuecomment-224678672, or mute the thread https://github.com/notifications/unsubscribe/AGzUYa5RXriFvTj1-2fDbqZ4bdK4gBroks5qJwU_gaJpZM4IrLrC .

David Mobley dmobley@gmail.com 949-385-2436

mrshirts commented 8 years ago

OK, I was under impression PropertyCalculator and PropertyEstimator were both still in play, sorry for any confusion.

jchodera commented 8 years ago

That is all hidden now to simplify the API.

davidlmobley commented 8 years ago

@jchodera - this is ready to be finalized once you make a couple of final modifications discussed above:

allow support for evaluation of multiple parameter sets in a single call
The names PhysicalPropertyMeasurement and ComputedPhysicalProperty will be replaced by MeasuredPhysicalProperty and ComputedPhysicalProperty

If it's easier, I can submit a PR to your fork which would implement the necessary changes.

davidlmobley commented 8 years ago

As discussed in today's call, we will also envision addition of another dataset type in addition to ThermoMLDataset which would be something like a GasPhaseMDDataset or EquilibriumExpectationDataset or something else that I will think through. I'll bring this up as a separate issue and (later) PR to the README.md

mrshirts commented 8 years ago

A few comments (part 1 of several - posting as separate comments for simplicity)

Would it be clearer, rather than create a mixture and addComponents, just have the mixture defined by arrays of compounds and mole fractions when it's initalized The reason being that if you start addComponent()-ing, the mole fractions aren't being enforced as you go -- you could say that the molefraction of the first is 0.2, the second is 0.2 -- then the last could be automatically be set to 0.6. But what if there are only two, and they are set to something that doesn't add to 1? At what point do you throw a flag? Not until the computation is done? And how do you know if the previous components had mole fractions defined?

So instead, why not just require on creation of a mixture that an array of components and array of mole fractions is passed in, and have the error checking at init time that the arrays are the same length, and the sums of the mole fractions are exactly 1?

mrshirts commented 8 years ago

Posted question is:

Is it OK that we don't specify the total number of molecules in the substance, and only specify the fractional composition? We would have to specify the total number of molecules in the

One issue is that there is not an exact mapping between mole fraction and number. If the mole fraction from experiment is 0.2237, will we need to simulate 10,000 molecules to be able to have exactly that mole fraction (yes, 2237 is a prime)? These might need to be quantities that are maintained separately. If you specify 0.2237, there are 500 total molecules, does it round to 112 molecules and change the concentration to 0.224? How do we deal with the mismatch? Note that I'm not totally sure here how to handle it -- other than carrying out experiments on both sides of the fraction, and linearly interpolating (i.e. do simulations at both 111 and 112, and weighting by 85% 112 and 15% 111.

It seems that we will likely need to decide how much approximation we are wiling to accept in mole fractions.

Incidentally, infinite dilution can be done perfectly well no matter what, since if there is only one solute molecule, then it really is good even if the actual mole fraction is more like 0.01, since we can't have aggregation. Properties of strongly associating molecules are more complicated, though, because when we are at mole fractions that are equivalent to, say, two molecules in the box of water, there's no possibility of higher order associations that might be occur in experiment.

mrshirts commented 8 years ago

Should make it clear that:

.value - the unit-bearing measurement value .uncertainty - the standard uncertainty of the measurement .reference - the literature reference (if present) for the measurement .DOI - the literature reference DOI (if available) for the measurement

Do not need to be defined in a dataset in order for property calculations to be performed.

mrshirts commented 8 years ago

Bryce, can you generate a list of the tags that we might use?

davidlmobley commented 8 years ago

@mrshirts - relating to mole fraction issues, I think the horse is already (in/out of) the barn on this one, in that we already handle this in SolvationToolkit with basically this API and plan to do something similar here. We check that the mole fractions sum to 1 when we get ready to build the mixture. So, in this API, a Mixture would just be something that stores the arrays of component mole fractions.

One issue is that there is not an exact mapping between mole fraction and number. If the mole fraction from experiment is 0.2237, will we need to simulate 10,000 molecules to be able to have exactly that mole fraction (yes, 2237 is a prime)? These might need to be quantities that are maintained separately. If you specify 0.2237, there are 500 total molecules, does it round to 112 molecules and change the concentration to 0.224? How do we deal with the mismatch? Note that I'm not totally sure here how to handle it -- other than carrying out experiments on both sides of the fraction, and linearly interpolating (i.e. do simulations at both 111 and 112, and weighting by 85% 112 and 15% 111.

It seems that we will likely need to decide how much approximation we are wiling to accept in mole fractions.

These are good points, but basically the way the API is constructed you get a chance to decide this for yourself later when doing the property calculation - specifically, the property calculation will get the specification of a mixture, and it gets to decide what size system to use, how many of each type of molecules, etc. Initially, presumably I/we will handle this as in SolvationToolkit, which is to say pick some size box we think is adequate, and get as close to the specified mole fraction as we can without being off "too far", i.e. by simple rounding. But there is room for more complicated treatment of these issues if/when it becomes necessary.

John has framed the API in a way that isolates us from these issues, as I think it should. Basically it allows us to do all that sort of science later if/when we need to.

openforcefield / open-forcefield-tools

Initial API outline proposal #3