Discussion of Smarty objective function

cbayly13 commented 8 years ago

Label: Real Issue probably needing John Chodera's attention

Following on from Slack/OpenForcefield/general post from this morning, which is cut and pasted here: ------------- begin paste One of the key issues is the objective function used in the smarty sampler: how does the Bayesian sampler elucidate if something is better (does "better"="more probable"?). For example, in the AlkEthOH world, there are exactly two oxygen atom types: OH (for alcohols) and OS (for ethers). Yet in the smarty output, a completely general oxygen type [#8] is only associated with one atom type, maybe OS or maybe OH. I think the behavior needs to be: a) all atom types matching the smarts need to be collected. b) each atom type needs to be identified as a separate population for scoring purposes. c) scoring needs to include a notion that, while the existing "good" of having a higher count of matches is retained, having more than one population is "bad". That way, when a new child smarts correctly identifies an atom-type, it results in a pure-population parent and a pure-population child, which is identified as clearly better. Thinking up such a scoring function I can do, but how it fits into the Bayesian procedure in general and smarty in particular is not clear to me. ------------- end paste

Attached is a log file from a 200-iteration run on the AlkEthOH set: smartytest_filt1.200.txt

Consider Iterations 110 and 111. By iteration 110 (line 1605), all the unused atom types are gone and smarty is back down to the simple 3 general atom types for C, O, and H, being [#1], [#6], and [#8]. Hydrogen [#1] should match HC and HO, and it appears in line 1610 it has found 464 matches (field 2) which I think should correspond to HC and HO. But when it comes to scoring Hydrogen in fields 6 and 7 of the same line, it only uses HC with a count of 244. Likewise Oxygen (line 1612) should match OH and OS, finds 107 matches (field 2) but appears to score only with OH with 68 matches. I think this is bug 1: scoring should use all the matches, both HC and HO for Hydrogen, and both OH and OS for Oxygen. I think bug 2 is that the objective function needs, at the same time, to recognize that capturing two reference populations (HC and HO for Hydrogen) is bad compared to recognizing a single population. This is exemplified in the next iteration (iteration 111): a child H-smarts is created resulting in the parent now capturing only a single population (HC) and the child capturing only a single population (HO); IMHO this is really good and should be rewarded somehow apart from the count. What I think is bug 3 is that somehow the acceptance/scoring/MCMC process seems to ultimately throw all these good discoveries away; over and over again I see a run discovering good child parameters that essentially solve the problem, but then it throws them all away and finishes with C, O, and H. I know the properly functioning MCMC can destroy good things but here "Not converging" is an understatement; is this properly functioning MCMC?

davidlmobley commented 8 years ago

@jchodera - this could use your attention when you have time, as it's shortly going to stall @cbayly13 (though I may be able to provide some help myself as well).

I did look at the code (had to deal with a deadline so I'm just digging in) and the acceptance criterion looks correct to me, so it seems like to verify "bug 3" (above) is a bug we'd need to actually look at whether the different possible scenarios are being occupied with the correct probabilities. But superficially, from looking at the output on AlkEthOH, it certainly does look like we're finding things we think are "good" and losing them too easily.

jchodera commented 8 years ago

Thanks for the detailed issue report, and to @davidlmobley for reviewing the relevant code already!

I will read through the code manually and add some unit tests to see if I can identify a bug, but I wonder if the issue isn't simply my having misunderstood @cbayly13's original description of his desired scoring function.

The original scoring function proposal---as I understood it---was this:

Identify the one-to-one correspondence between SMARTY and parm@frosst atomtypes that maximizes the number of atoms that are typed the same way by both atomtype sets
Return the number of atoms typed the same way

This has the desirable property that only when all atoms are typed into different classes by SMARTY in a manner that matches parm@frosst is the scoring function maximized. If a SMARTY atomtype matches atoms that are subdivided into multiple parm@frosst types, the score will be low, but splitting off a child that matches one of the parm@frosst types would increase the score.

@cbayly13 notes that his desired scoring function is actually different:

I think this is bug 1: scoring should use all the matches, both HC and HO for Hydrogen, and both OH and OS for Oxygen. I think bug 2 is that the objective function needs, at the same time, to recognize that capturing two reference populations (HC and HO for Hydrogen) is bad compared to recognizing a single population.

This suggests (1) we should allow one-to-many relationships between SMARTY and parm@frosst, and (2) we should have a penalty for one-to-many relationships. While I understand how we could do (1), this sounds undesirable without (2), since a single SMARTY type that matches any atom would then match every atom and would rely entirely on the penalty in (2). I'm not sure how we would implement (2), though---how would we penalize multiple populations? Some sort of entropy metric? What would the penalty factor scaling the entropy metric be in order to balance (1) and (2)?

I'm happy to implement alternative scoring functions, but I think we'd need to come up with something more concrete.

In the meantime, this also highlights the need for better tools to examine

the atomtype-to-atomtype correspondences
molecules that are poorly typed by SMARTY

If you have any ideas on those, please suggest in separate threads how I could implement them or expose things to make it easier for you to implement them!

davidlmobley commented 8 years ago

@jchodera - We need to think through the scoring function more, though it may not be the main issue at present, as until right now I don't think we understood how the scoring function was working. @cbayly13 is going to be offline for the weekend so I'll have to discuss this with him Monday morning.

But, your point here:

This has the desirable property that only when all atoms are typed into different classes by SMARTY in a manner that matches parm@frosst is the scoring function maximized. If a SMARTY atomtype matches atoms that are subdivided into multiple parm@frosst types, the score will be low, but splitting off a child that matches one of the parm@frosst types would increase the score.

is basically what we want in a scoring function. @cbayly13 says it should be the scoring analog of "The truth, the whole truth, and nothing but the truth" - SMARTS matching the atom types, all the atom types, and nothing but those atom types. It sounds like what you have in place will do that already (with an important exception I'll discuss below!).

EXCEPTION: In @cbayly13 's view, one should be rewarded for matching all of the atom types even via generics - i.e. you should get a decent score by just having a single elemental type for each relevant element, even though it covers multiple parm@frosst types. There should be a penalty for matching multiple types - but it clearly should be better to have a generic carbon which covers all carbons, even though it matches multiple parm@frosst types, than to only have a specialized carbon which matches one specific parm@frosst type. Am I understanding right that the current scoring function would treat these two as identical? Let's think of AlkEthOH - I should ultimately end up with two oxygen types, OH and OS. But it should be better to have a single generic oxygen that covers everything than to have only OH, otherwise we'll never be able to learn parsimony when needed (i.e. if we need to be able to propose moves which destroy specialized atom types in favor of generics). Obviously it ought to be best to have OH and OS, but having "generic oxygen" should be better than OH or OS alone. If I understand right, the current scoring function doesn't recognize that. Is that the case?

@cbayly13 's answer to this is to reward things for covering all atom types, even if the correspondence is not one SMARTS to one atom type, but to have a lesser penalty for SMARTS which cover multiple atom types.

The larger issue that led to this discussion was what we see now on the AlkEthOH test, where we're seeing "good" new atom types/SMARTS being found, but they seem to be getting destroyed too often rather than retained (though we haven't computed the relevant statistics yet). Since we didn't totally understand the scoring function yet we thought that might be the issue. This discussion leads me to believe that the current scoring function ought to be working better than we're seeing (though we need to compute those statistics to confirm -- we need to get at something like the probability of having OH- and OS-equivalent types, versus the probability of having only a single oxygen type) which may suggest a problem aside from the scoring function.

Let me post this to get it out there and then think for a bit about whether I can generate a concrete proposal for what the penalty ought to be if we do modify the scoring function.

jchodera commented 8 years ago

but it clearly should be better to have a generic carbon which covers all carbons, even though it matches multiple parm@frosst types, than to only have a specialized carbon which matches one specific parm@frosst type.

Can you elaborate on the rationale here? I don't quite understand the reasoning.

By construction, the sampling algorithm will ensure that the typing scheme is ALWAYS complete. Any typing scheme that would result in untyped atoms is rejected. The typing is also hierarchical, so we have general parent atomtypes and some more specific child atomtypes. So the scenario you suggest---only having a specific carbon atomtype that would leave some carbon atoms untyped---is simplest impossible.

Let's think of AlkEthOH - I should ultimately end up with two oxygen types, OH and OS. But it should be better to have a single generic oxygen that covers everything than to have only OH, otherwise we'll never be able to learn parsimony when needed (i.e. if we need to be able to propose moves which destroy specialized atom types in favor of generics). Obviously it ought to be best to have OH and OS, but having "generic oxygen" should be better than OH or OS alone. If I understand right, the current scoring function doesn't recognize that. Is that the case?

Remember that we are exclusively working with HIERARCHICAL typing schemes here. If parm@frosst is a COMPLETE typing scheme instead, where the union of OH and OS is meant to be complete, this is incompatible with the hierarchical atom typing. Instead, we would have something like a parent atom type and a child that splits off either OH or OS.

Parsimony is partly driven by the entropic penality not currently in the sampling scheme because we don't have priors over parameters, so that won't quite be something we can assess at this stage. We are just looking for whether the typing scheme is able to mostly match the atom typing of the database by a greedy-ish optimization scheme.

jchodera commented 8 years ago

The larger issue that led to this discussion was what we see now on the AlkEthOH test, where we're seeing "good" new atom types/SMARTS being found, but they seem to be getting destroyed too often rather than retained

This may be an issue with the temperature being too high. The statistic of interest is how big the increase in fraction of atoms matched is compared to the input fractional temperature. If the number of typed atoms increases, the proposed atomtypes will be accepted. But the temperature will allow backtracking---loss of typed atoms---fraction on the scale of the fractional temperature. If the fractional temperature is 0.05, you an easily accept a proposal that will LOSE 5% of the typed atoms. This might be what is going on. Play with he temperature first!

davidlmobley commented 8 years ago

By construction, the sampling algorithm will ensure that the typing scheme is ALWAYS complete. Any typing scheme that would result in untyped atoms is rejected. The typing is also hierarchical, so we have general parent atomtypes and some more specific child atomtypes. So the scenario you suggest---only having a specific carbon atomtype that would leave some carbon atoms untyped---is simplest impossible.

Ah, OK, sorry - I was forgetting that this insists that everything be able to be typed (probably because it's not obvious from the README.md and I haven't read the code fully enough yet) which does mean that we're going to start off with and retain generics.

It sounds like this does have the features we're looking for though, so we may be OK as is.

davidlmobley commented 8 years ago

This may be an issue with the temperature being too high. The statistic of interest is how big the increase in fraction of atoms matched is compared to the input fractional temperature. If the number of typed atoms increases, the proposed atomtypes will be accepted. But the temperature will allow backtracking---loss of typed atoms---fraction on the scale of the fractional temperature. If the fractional temperature is 0.05, you an easily accept a proposal that will LOSE 5% of the typed atoms. This might be what is going on. Play with he temperature first!

Yes, I suggested that to @bannanc and @cbayly13 last night after reading the relevant code. I was going to try cranking it down to something very low to see if it acts the way I would expect in that limit.

jchodera commented 8 years ago

If you sent the temperature to zero, it will just behave like a greedy algorithm, with the real temperature allowing backtracking of only one or two atoms. Something close to zero may be what you want.

If we end up getting stuck in minima, we can add something like simulated tempering or replica exchange, but let's see how far you get with this.

Starting with manually-construxted SMARTS strings for AlkEthOH would be a good way to ensure that what you consider optimal typing actually achieves a high or near perfect score. This might be the most important control experiment to try first.

davidlmobley commented 8 years ago

Starting with manually-construxted SMARTS strings for AlkEthOH would be a good way to ensure that what you consider optimal typing actually achieves a high or near perfect score. This might be the most important control experiment to try first.

That's a good idea, yes. @bannanc - this would be something for you or @cbayly13 to try Monday or thereabouts. I'll see if I can try the temperature thing today between chores/kid duties. :)

jchodera commented 8 years ago

Great! And again, I'm happy to implement any concrete alternative scoring schemes or diagnostics you guys come up with, or help make changes that make it easier for you to plug in new scoring functions yourself.

davidlmobley commented 8 years ago

OK, @jchodera - I'm starting to get it, especially now that I'm looking at the output/code again. Now I see that the SMARTS match multiple atom types, but the scoring function only rewards matches to one atom type, which seems reasonable. This is not totally clear from the output (we can work on that) but now it makes sense.

One thing I'm still not clear on (but presumably I can figure out once I finish looking at the code): How does it decide which parm@frosst type to score? Is it the one yielding the largest number of matches?

jchodera commented 8 years ago

One thing I'm still not clear on (but presumably I can figure out once I finish looking at the code): How does it decide which parm@frosst type to score? Is it the one yielding the largest number of matches?

Correct! The current algorithm tries to find the one-to-one matching of SMARTY types to parm@frosst types that maximize the number of atoms in the molecule set typed in common.

Is there some visualization that could better help you visualize this? We could draw the matrix of SMARTY and parm@frosst to show how many atoms match each pairing, and then highlight the cells for the maximum bipartite marching correspondences, for example.

davidlmobley commented 8 years ago

I've verified that after the fix I just submitted ( #5 ) this behaves appropriately when T=0 or T=small (i.e. T=0.001), in that destruction of "good" atom types gets a lot less common. It appears that T=0.1 is too high for the AlkEthOH example, at least. Here's the final output from the AlkEthOH test at T=0.001 (T=0 is virtually identical):

INDEX        ATOMS  MOLECULES                                                          TYPE NAME                           SMARTS REF TYPE        FRACTION OF REF TYPED MOLECULES MATCHED
    1 :        396         42 |                                                         hydrogen                             [#1]       HC              244 /              244 (100.000%)
    2 :         68         42 |                                         hydrogen oxygen-adjacent                   [#1&$(*~[#8])]       HO               68 /               68 (100.000%)
    3 :        232         42 |                                                           carbon                             [#6]       CT              232 /              232 (100.000%)
    4 :         39         30 |                                                           oxygen                             [#8]       OS               39 /               39 (100.000%)
    5 :         68         42 |                                         oxygen hydrogen-adjacent                   [#8&$(*~[#1])]       OH               68 /               68 (100.000%)
TOTAL :        803         42 |                                                                                                         651 /      803 match (81.071 %)

davidlmobley commented 8 years ago

Is there some visualization that could better help you visualize this? We could draw the matrix of SMARTY and parm@frosst to show how many atoms match each pairing, and then highlight the cells for the maximum bipartite marching correspondences, for example.

Yes, something like this sounds useful. Definitely this is something to think about though. One possibility is just to wait til Monday when I can discuss with the group, then get back to you. I am not TOTALLY sure I know what you mean though - maybe you want to draw a cartoon example of this for a simple case on a sheet of paper and attach? ;)

Now that I just fixed the "T=0.1 no matter what" bug, everything seems to be behaving properly as far as I can tell. So I think we're in business again.

jchodera commented 8 years ago

Great! Suggest you try a hand coded em set of SMARTY types for reference on Monday.

I will mock up the matrix I am suggesting.

davidlmobley commented 8 years ago

Chris is going to try the hand coded set of types.

We've done a bunch more testing and identified a series of problems and/or changes we believe are needed that I'll be trying to describe shortly in a set of individual issues. I'll leave this one open as it contains a lot of useful discussion/info.

jchodera commented 8 years ago

Chris is going to try the hand coded set of types.

Sounds like a good positive control!

We've done a bunch more testing and identified a series of problems and/or changes we believe are needed that I'll be trying to describe shortly in a set of individual issues.

Sounds great!

davidlmobley commented 8 years ago

This issue is resolved in that starting with hand-coded atom types does result in a 100% score (and as of #32 is now a test), though the discussion here may still be interesting down the line. Marking as closed for now.

openforcefield / smarty

Discussion of Smarty objective function #4