ChemicalEnvironment sampling API

jchodera commented 8 years ago

I just wanted to make a brief proposal for a chemical environment sampling API.

The ChemicalEnvironment API (along with subclasses AtomEnvironment, BondEnvironment, AngleEnvironment, and TorsionEnvironment) is looking good. With a few additions, we will be able to utilize the tools from smarty to actually sample over different environments in the physical property sampler in https://github.com/open-forcefield-group/open-forcefield-tools.

I propose we encapsulate the generation of new ChemicalEnvironment proposals in subclasses of a ChemicalEnvironmentProposalEngine, with a very simple API, like:

# Create a new chemical environment proposal engine, which may take some implementation-specific parameters
engine = ChemicalEnvironmentProposalEngine(...parameters...)
# Propose a new atom chemical environment from an existing one
[proposed_child_atom_environment, logP] = engine.propose(parent_atom_environment)
# The same function would be used for bonds, angles, etc.
# The type (atom, bond, angle, torsion) is automatically detected from the number of SMIRKS atom labels.
[proposed_child_bond_environment, logP] = engine.propose(parent_bond_environment)
[proposed_child_angle_environment, logP] = engine.propose(parent_angle_environment)

where logP is the log probability of generating this particular child ChemicalEnvironment given the parent ChemicalEnvironment, which will be used in computing the Metropolis-Hastings acceptance probability.

davidlmobley commented 8 years ago

This looks reasonable to me. @bannanc should have a look, and we can bring up with @cbayly13 this afternoon as he probably won't have time before then.

(I'm about to step out until roughly the time of our call.)

davidlmobley commented 8 years ago

@jchodera - I went over this with Christopher and what you have written here looks reasonable to both of us. However, there are a variety of aspects we're not clear about yet. One of these is just "what happens between this line":

engine = ChemicalEnvironmentProposalEngine(...parameters...) "and this line?" [proposed_child_atom_environment, logP] = engine.propose(parent_atom_environment)

Something has to pick which parent atom environment to modify, get it into a chemical environment (or maybe we have chemical environments for everything already stored in memory?) and then feed it into the proposal engine. How do you imagine that working, or is that something we should be laying out at this stage as well?

Another issue is that the chemical environment objects are associated with particular parameters; are we expecting to be able to transition back and forth between these environment objects and SMIRKS/associated parameters as part of the move proposal process? I am imagining there are at least two general ways of proceeding:

SMIRKS roundrips: We use SMIRKS as the link between ForceField and ChemicalEnvironment objects in going both directions. For example, we might have a ForceField object which contains all of our chemical environments (specified by SMIRKS) and parameters, and when we prepare a move (or initialize our code) the SMIRKS are mapped to ChemicalEnvironment objects and then move construction consists of creating new ChemicalEnvironments which can produce SMIRKS when needed. In this framework, we would basically see the ForceField as our major data structure and we do roundtrips from ForceField -> ChemicalEnvironment objects -> ForceField, so we need SMIRKS -> ChemicalEnvironment -> SMIRKS capability
Producing SMIRKS only: We link ForceField and ChemicalEnvironment in some other way. For example, we might have a ForceField object which contains the current contents of the XML, PLUS additional specification of a ChemicalEnvironment for each force term, so we only need to go from ChemicalEnvironment objects to SMIRKS and not vise versa.

It seems like it should be possible to achieve the same end goal via both means, and I'm not sure if one is more general or preferable.

Note that we are thinking that, for other reasons, we are going to need the ability to go from SMIRKS to ChemicalEnvironment objects at some point (one reason may be for initialization, i.e. in point 2 above); right now we can only go from ChemicalEnvironment objects to SMIRKS and not vise versa.

jchodera commented 8 years ago

Something has to pick which parent atom environment to modify, get it into a chemical environment (or maybe we have chemical environments for everything already stored in memory?) and then feed it into the proposal engine. How do you imagine that working, or is that something we should be laying out at this stage as well?

There will be a parameter sampling engine that sits on top of this and has "plugin" methods that are incorporated into the parameter sampling framework, but that can stay pretty invariant if we have the ability to plug in different child atom type proposal schemes. We should spec out an API for that separately.

Another issue is that the chemical environment objects are associated with particular parameters;

The child parameter proposal scheme can be decoupled from the child ChemicalEnvironment proposal scheme. We'll want to have "plug-in" ways of treating these too, but this will be a separate API.

are we expecting to be able to transition back and forth between these environment objects and SMIRKS/associated parameters as part of the move proposal process?

It would be ideal if there was an invertible one-to-one mapping between ChemicalEnvironment objects and SMIRKS strings. How hard do you think that is?

If it is very difficult, we'll have to maintain the ChemicalEnvironment objects in the ParameterSet during sampling, and this may accelerate the need to develop a ParameterSet API.

bannanc commented 8 years ago

It would be ideal if there was an invertible one-to-one mapping between ChemicalEnvironment objects and SMIRKS strings. How hard do you think that is?

SMIRKS to Chemical Environments is going to take some time to sort out, but I believe it's doable. I've spent some time thinking about it, but haven't drafted any formal code yet. It's really just slightly complicated string parsing.

[proposed_child_atom_environment, logP] = engine.propose(parent_atom_environment)

Do we already know how to calculate a logP value with just one one parent and one child parameter? Or would the engine know about all of the chemical environments currently describing that set of parameters? Maybe that's a discussion about internal details and not the API.

davidlmobley commented 8 years ago

Do we already know how to calculate a logP value with just one one parent and one child parameter? Or would the engine know about all of the chemical environments currently describing that set of parameters? Maybe that's a discussion about internal details and not the API.

@bannanc - the logP here is the move proposal probability if I understand correctly; it allows you to introduce bias into your move proposals without breaking the balance needed (i.e. detailed balance). Specifically, imagine you want to propose moves which do X more often than moves which do Y because you think that X moves are more likely to be productive (maybe, make change sigma and epsilon in a coupled way more often than you change them in opposition or something). To be "allowed" to do this you have to know the probability that you proposed that specific move - i.e. if I propose X moves 90% of the time and Y moves 10% of the time then I have to correct for this in my acceptance criteria so that I still sample the right distribution at the end.

So, this is something which has to be built into the move proposal engine if it is biased, if I understand correctly.

bannanc commented 8 years ago

@davidlmobley that does make sense, I'm sure we'll have to discuss it in more detail later.

Also I realized I didn't comment on this in general, I think the API for the engine is completely reasonable, we should be able to translate the moves part of the "smirky sampler" @cbayly13 and I are working on directly into it.

jchodera commented 8 years ago

Just to kick of some brainstorming about the sampler API level above this, we might have something like this:

# Define a Bayesian posterior model
model = BayesianForcefieldModel(thermoml_dataset)
# Create a parameter sampler
sampler = ParameterSampler(model, initial_parameter_set)
# Register some proposal types, along with weights corresponding to the relative probability for selecting each move type
# Nonbonded type creation/destruction proposal
sampler.registerProposal(NonbondedTypeProposal(nboptions), weight=1.0)
# Nonbonded parameter sampling proposal
sampler.registerProposal(NonbondedParameterProposal(), weight=1.0)
# Bond type creation/destruction proposal
sampler.registerProposal(BondTypeProposal(bondoptions), weight=1.0)
# Bond parameter sampling proposal
sampler.registerProposal(BondParameterProposal(bondoptions), weight=1.0)
..
# BCC sampling proposal
sampler.registerProposal(BondChargeCorrectionProposal(bccoptions), weight=0.1)

These ...Proposal objects would all be subclasses of a ParameterSetProposal object, which has a simple API like

proposal_engine = ParameterSetProposal(options,...)
[proposed_parameters, logP] = proposal_engine.propose(current_parameters, model)

and could create/delete a parameter, propose a modification to a parameter, propose a change of functional form, or even propose coupled moves between multiple parameters.

For example, the NonbondedTypeProposal object would cache a ChemicalEnvironmentProposalEngine and use that to create/destroy types, using another scheme to propose new parameters from the parent or from the prior.

davidlmobley commented 8 years ago

@jchodera - the above looks good to me.

So then the sampler would be using the ChemicalEnvironmentProposalEngine discussed above internally once we actually get to proposing moves?

And the sampler would handle actual move acceptance/rejection, i.e. sampler.makeMove() or something along those lines?

jchodera commented 8 years ago

All of the ProposalEngine subclasses will have the same interface. They will need to

Propose a new forcefield from an old one (propose), returning the log ratio of proposal probabilities log P(old|new)/P(new|old)

That may be it for now.

The sampler will have a 'run' method that runs a specified number of iterations of the sampler. The 'update' method will generate a single sample according to the probability assigned to each move type.

The sam

cbayly13 commented 8 years ago

@jchodera @davidlmobley @bannanc I just realized something important that I want to embody here. It is in the spirit and direction of the ProposalEngine, and it is simply this: choosing decorators needs to be weighted differently for different atomic elements. Guessed examples based on experience with organic chemistry:

Base [Decorators with Odds]

C [ ('X4', 20), ('X3', 20), ('X2', 5), ('X1', 1), ] N [ ('X4', 10), ('X3', 10), ('X2', 10), ('X1', 1), ] O [ ('X4', 0), ('X3', 1), ('X2', 20), ('X1', 20), ] H,F,Cl,Br,I [ don't use this decorator on this atom ]

C [ ('H0', 1), ('H1', 1), ('H2', 1), ('H3', 1) ] N [ ('H0', 1), ('H1', 1), ('H2', 1), ('H3', 1) ] O [ ('H0', 20), ('H1', 20), ('H2', 1), ('H3', 0) ] H,F,Cl,Br,I [ don't use this decorator on this atom ]

C [ ('-1', 1), ('+0', 50), ('+1', 0) ] N [ ('-1', 1), ('+0', 20), ('+1', 5) ] O [ ('-1', 10), ('+0', 50), ('+1', 1) ] H,F,Cl,Br,I [ don't use this decorator on this atom ]

I cannot think of an instance where elements H,F,Cl,Br,I need a decorator. Substituents, yes... substituents with decorators... yes.

davidlmobley commented 8 years ago

I totally agree with this idea, @cbayly13 .

jchodera commented 8 years ago

I certainly agree with the principle that adjusting the probabilities with which different decorators are proposed based on element type could lead to large sampling efficiency games, but I'd prod you to see if you can think of a simple way we could have the sampling scheme discover or tune these probabilities itself. Downloading your chemical intuition is a great first step, @cbayly13, but it would be even better if we could come up with a general principle and have the sampling scheme discover the intuition using this simple principle.

For example, what if, during "equilibration", we adjusted the weights of each decorator that was accepted by +1? Or alternatively, just counted the number of atoms each element & decorator pair matched in a large molecule database? Would that give us the same kind of weighting?

davidlmobley commented 8 years ago

@jchodera - this is interesting. I'd been struggling to think of how we could "learn" what types of moves are productive without breaking the entire scheme, but I think your point about "equilibration" might be well-taken... We could run for some period with "adaptive" probabilities where we learn what types of moves seem to do any good and what don't and have that provide some feedback for how we fix probabilities for actually collecting data that samples from the correct distribution.

davidlmobley commented 8 years ago

That said, I think we should actually have @cbayly13 continue along the same route he's on right now; he's not really an expert on machine learning/Monte Carlo/sampling schemes, so what will probably work best is for him to encode what he thinks the probabilities ought to be based on his chemical intuition, then when we can figure out a suitable principle/sampling scheme we can see what we discover and how well it matches with his intuition.

To put it another way -- we've got him for ~2 more weeks and there is no way we'll be sampling SMIRKS patterns in combination with property calculations in the next two weeks, so if we wait to be able to learn this, then we'll have no way of comparing what we get with what he might have expected. On the other hand, if he goes ahead and builds these weights/probabilities himself manually, then when we get there we'll be able to compare with what he expected.

cbayly13 commented 8 years ago

I agree with the above discussion: with @jchodera on the goal of getting BaFFLE to learn good moves, and with @davidlmobley that I could not wisely undertake this in my remaining time (which is a huge bummer for me because this is getting more and more exciting and I would actually want to learn these adaptive methods if I had time).

As Chemistry Guy, my main concern is that I have not yet succeeded in communicating how my "seasoned" human chemical perspective translates into the science we are doing. Hence my crude attempts, like couching expertise in terms of "...choosing decorators needs to be weighted differently for different atomic elements." At least it beats communicating expertise by pronouncing new atom types...

davidlmobley commented 8 years ago

I think this is actually a perfect way to maximize the use of your remaining time before switching to Parm@frosst -> SMIRFF, @cbayly13 .

jchodera commented 8 years ago

I definitely think @cbayly13 should forge ahead with his thoughtfully selected weights! I just don't want to lose the opportunity to get him to think about the "meta learning rule" that would be able to learn this knowledge on the fly: Can he come up with ideas that might be able to roughly approximate this knowledge using a simpler rule?

Sent from my iPhone

cbayly13 commented 8 years ago

I will try to think of formalizing higher-level rules as @jchodera suggests.

davidlmobley commented 7 years ago

@maxentile - this is also a related issue.

bannanc commented 7 years ago

@jchodera and @davidlmobley I think I missed this discussion, or at least the finer details in the whirlwind that was mid to late August.

My instincts on a "meta learning rule" here is to think about what decorators do. They are supposed to be used to distinguish elements from one another. So a [#6X4] would be a subset of [#6]. However for the atoms where @cbayly13 said he never uses decorators, primarily hydrogens and halides. Most decorators will either match all of that element or none of them so you can only have [#1X1] and it would match all of the [#1] atoms. However, This doesn't really get to the probability of different decorators as much as the question of is this decorator useful for this element number.

Or alternatively, just counted the number of atoms each element & decorator pair matched in a large molecule database? Would that give us the same kind of weighting?

It think this is sort of what I'm getting at, but removing any decorator that matches all of a certain element number, such as #1X1 or #17R0

openforcefield / smarty

ChemicalEnvironment sampling API #110