openforcefield / smarty

Chemical perception tree automated exploration tool.
http://openforcefield.org
MIT License
19 stars 8 forks source link

Plan/implement chemical object infrastructure for SMIRKS-ifying SMARTY #49

Closed davidlmobley closed 8 years ago

davidlmobley commented 8 years ago

We've realized that to go past AlkEthOH without trying to handle WAY TOO MANY special cases in construction of decorator proposals, we are going to need to move past treating SMARTS as simple strings to have the SMARTS proposal engine actually understand these. And since we have to move to SMIRKS anyway, we should really be doing this for SMIRKS and not for SMARTS.

Thus, @cbayly13 is attempting to plan the infrastructure that is going to be needed for this.

I'll update this issue (or one of us will) as we have more details.

jchodera commented 8 years ago

This sounds like exactly what we need. We can consider chemical graphs, where each node is annotated with SMARTS strings, and a Monte Carlo proposal process that can create and destroy connected nodes or modify the SMARTS strings on one or more nodes. networkx may be useful in representing these graphs since we already use it.

This might also call for an additional tool where you apply a set of SMIRKS to a huge druglike chemical dataset to see if we can type everything. We might be able to learn rules about which Monte Carlo moves are efficient (in the sense of proposing chemically reasonable SMIRKS when starting from chemically reasonable parent SMIRKS) using this.

davidlmobley commented 8 years ago

The immediate problem we are focused on is how to even propose sensible SMARTS (or here, SMIRKS) which are suitably decorated to cover what we want, EVEN IF most of the time they don't make chemical sense. Having them mostly make chemical sense will also be an important optimization.

I'll mention the "molecules as graphs" idea to Christopher right now to see if he thinks it helps with this at all.

davidlmobley commented 8 years ago

@jchodera : On discussion with Christopher I realized I'm confused. What are the chemical graphs you're envisioning? Are these graphs molecules, or the bipartite graph currently in Smarty where we're using atom types for graph coloring, or something else?

Maybe this will be a topic for our catch-up call tomorrow.

jchodera commented 8 years ago

Update: @cbayly13 will present some slides on how he thinks this might work at our Wednesday call, and then we will turn this into a detailed proposal for implementation.

bannanc commented 8 years ago

@cbayly13 We discussed this a fair amount during the call today, I'm working on organizing my notes and writing up how we plan to implement at ChemicalEnvironment object which will generally use networkx graph to help organize the chemical environments we have been talking about. I'm going to work on detailed proposal and include it in a comment here when it is more organized.

davidlmobley commented 8 years ago

@cbayly13 's slides for the call, for the record.

smirks_bondsanglestorsions.pdf

bannanc commented 8 years ago

CHEMICAL ENVIRONMENT OBJECT

This is a first draft at what I think needs to be included in the chemical environment object. @davidlmobley and @jchodera your comments would be appreciated. I think I will just keep editing this comment from the discussion I'm sure is to follow.

I am working on developing a chemical environment object. As discussed, it will use networkx graph objects to organize the chemical environment. It will have nodes that are atoms and edges that are bonds, and will include methods for making "moves" in the chemical environment space.

It will be initialized with a "connection-type" such as atom, bond, angle, torsion along with a list of basetypes and decorators.

This is the list I have of methods we will want:

Implement chemical environment:

basetypes = AtomTyper.read_typelist(basetype_filename)
decorators = AtomTyper.read_typelist(decorator_filename)
torsion = chemicalEnvironment('torsion', basetypes, decorators)

Some things to consider for this implementation:

bannanc commented 8 years ago

HOW TO STRUCTURE ATOMS (nodes):

Same as the comment above, I think it will be easiest to edit this as we establish things in the discussion I am sure will follow

I have been thinking about how to structure the nodes (atoms), I'm still learning how to use the graph objects, so maybe their are properties that already exist in the graph objects we can use, maybe they need to be a separate object, or can be a dictionary. There are some considerations below for how to use the list of basetypes and decorators.

Right now we know at least three properties an atom will need:

I have a syntax question that might need @cbayly13 to answer. To give some context:

However, there are still two ways to do this,

jchodera commented 8 years ago

Thanks for taking a stab at this, @bannanc!

You might check out the PEP8 style guide for Python coding conventions. If we mostly stick to this (e.g. regarding method names), it will help keep coding styles consistent. (I realize I probably screwed this up already with ForceField since I started from simtk.openmm.app.ForceField, which uses CamelCase convention instead of understrike_naming convention, but this was done for consistency with the OpenMM version.)

I like this list of methods. Here's what I would do if I was drafting the API:

# Use NetworkX for handling graphs
import networkx as nx

class ChemicalEnvironment(object):
   """Chemical environment abstract base class that matches an atom, bond, angle, etc.
   """
   class Atom(object):
      """Atom representation, which may have some base (logical OR) and decorator (logical AND) properties.

      Type will be OR(basetypes) AND AND(decorators).

      Properties
      -----------
      basetypes : set of string
         The base types that will be combined with logical OR
      decorators : set of string
         The decorators that will be combined with logical AND
      """
      def __init__(self):
         """Initialize an Atom object with empty basetypes and decorators.
         """
         self.basetypes = set()
         self.decorators = set()

      def asSMARTS(self, index=None):
         """Return the atom representation as SMARTS/SMIRKS.

         Parameters
         ----------
         index : int, optional, default=None
            If not None, the specified index will be attached as a SMIRKS index (e.g. '[#6:1]')

         Returns
         --------
         smarts : str
             The SMARTS/SMIRKS string
         """
         pass

   class Bond(object):
      """Bond representation, which may have base (OR) and decorator (AND) types.
      """
      # implementation similar to Atom
      pass

   def __init__(self):
      """Initialize a chemical environment abstract base class.
      """
      # Create an empty graph which will store Atom objects.
      self._graph = nx.Graph()

   def asSMARTS(self):
      """Return a SMARTS/SMIRKS representation of the chemical environment.
      """
      pass

   def selectAtom(self):
      """Select an atom with uniform probability.
      """
      pass

   def selectBond(self):
      """Select a bond with uniform probability.
      """
      pass

   def addAtom(self, atom, bond, bondedAtom):
      """Add an atom to the specified target atom.
      """
      pass

   def removeAtom(self, atom):
      """Remove the specified atom from the chemical environment, along with its associated bond.
      """
      pass

class AtomChemicalEnvironment(ChemicalEnvironment):
      """Chemical environment matching a single atom.
      """
      def __init__(self):
         """Initialize a chemical environment corresponding to matching a single atom.
         """
         # Initialize base class
         super(AtomChemicalEnvironment,self).__init__()
         # Add a labeled atom corresponding to :1
         # TODO

# and so on for BondChemicalEnvironment, AngleChemicalEnvironment, TorsionChemicalEnvironment
bannanc commented 8 years ago

Thanks @jchodera I'll take a look at the PEP8. I'll take your suggested API as a starting point and work on this, I'll post with any clarifying questions.

davidlmobley commented 8 years ago

OK, @bannanc , so I'm looking at these proposals in a bit more detail. It looks like a great first pass from both you and John.

Before getting to questions which have been raised, I want to give some random thoughts:

1) Since this is a total rework of everything, we probably do want to take advantage of this chance in order to build in some chemistry at the atomic level to make it easier for ourselves if possible. For example, what if Atom has a GetElement() method which can return the element if it is NOT already a combination of multiple types via an "or"? The reason this would be helpful is that it will allow us to do things like say that a carbon can't be connected to six things, etc. 2) Item 1 may be stretching the API above slightly. Shouldn't we also use this opportunity to build in chemistry more thoroughly? What you're calling Atom in the above API is really AtomCombo as it's one or more atoms which will be at a particular node in the graph. It seems like it should, itself, consist of several Atoms, each of which has an element and is possibly decorated. 3) Right now there's a small documentation/clarity problem, as we have the Atom (AtomCombo in my terminology) taking:

    basetypes : set of string
         The base types that will be combined with logical OR
      decorators : set of string
         The decorators that will be combined with logical AND

but we haven't said whether the basetypes are individually being decorated by the decorators or are collectively being decorated by the decorators - i.e. if I have basetypes ['#6', '#8'] and decorators ['X2'] is that even valid, or do I need two decorators? And if it's valid, do I end up with the SMARTS [#6X2,#8], [#6X2,#8X2] or something else? It seems to me this is taking us too far down the road of "I have to already know about SMARTS syntax in order to create an Atom", which is partly why I think this should be an AtomCombo which itself is composed of Atoms which contain an actual chemical element, an associated individual base type, a decorator, and the names for the base type + decorator, as in item 3 above. And a Bond is between an AtomCombo, etc. Alternatively, if this notation is too confusing, I suppose there are some sensible alternative namings, such as Atom and Element for the nodes and basic building blocks, respectively, or Node and Atom for the nodes and basic building blocks. I actually think Atom and Element may make the most sense, since we do add bonds between Atoms so it makes sense to retain that notation, even though Atoms are combinations of things. 4) @bannanc , make sure you include the notion of which atoms are labeled in the framework. 5) Minor niggly detail, but for clarity I think we should have the method asSMARTS be renamed asSMIRKS since we're going to be using SMIRKS everywhere and the downstream use of these is subtly different, so reminding people that they are SMIRKS rather than SMARTS when they call the function is probably a good idea. 6) (Minor detail we can address later) In John's framework, do we also want changeBond in addition to addAtom and removeAtom? It could be a way to discover derivative types without having to delete and insert atoms - i.e. if we have a [#6:1]-[#6:2](-[#8])-[#8] then maybe [#6:1]=[#6:2](-[#8])-[#8] will also be interesting on its own, and it could be a lot faster to get there that way. 7) Terminology: Maybe it's time to get rid of the basetypes terminology entirely for clarity; everything will consist of elements which may get ORd together to construct more complicated patterns and there are no basetypes.

I'll get to your questions, @bannanc , momentarily in a separate comment.

davidlmobley commented 8 years ago

@bannanc , question responses:

  • should we have an optional list of basetypes or bondtypes that will be used to initialize the chemical environment object?

I think at this point the idea is to construct the objects; we'll get to initialization when we start playing with them. That said we will want to initialize with what elements we want to consider and what decorators they can have. Likewise for bond types, I suppose. Initially I wanted to say we should just build these in, but I suppose if someone wants to explore a forcefield where it only makes sense to sample across single bonds we should let them. :)

There are probably two different initializations which need to happen: 1) An initialization of everything which is within our overall scope, i.e. what elements/decorators/bonds can possibly be considered (and their names) 2) An initialization of a set which matches everything, i.e. a starting point which covers everything. Presumably for torsions, this would just be that we have a single torsion which matches X-X-X-X, and for bonds a single bond which looks like X-X, etc.

  • Should the chemical Environment object read the basetype files or decorators? I think I like it being general enough that we could feed it lists instead?

You want to separate the objects from the file parsing, IMO, otherwise you lose generality. So initialization happens by passing in arguments. Elements/decorators/bonds can be read from files and then passed in, or input by human and passed in. (Here, since we know what bonds we're interested in, I see no point in reading these from a file.)

  • Should we continue trying to make these have a human readable format? As we've seen the ones for H1, H2, H3 are not really understandable anymore.

I would not work that hard at this, but it could be something like outputting the element types for the labeled atoms ONLY along with connectivity info, i.e. [#1:1]-[#6:2](-[#8])=[#6:3] is "hydrogen singly bonded to carbon doubly bonded to carbon"; maybe we should ditch the English and go with simplified SMARTS for just the labeled atoms, i.e. [#1]-[#6]=[#6]. English gets too complicated because of grammar considerations.

In other words, I'm sort of undecided. Skip this in your initial implementation, as it will be hard to make it worthwhile, I think.

  • Should bond types or bond decorators be an input?

Addressed above I think. Probably we should have these all be inputs just in case someone wants to limit sampling by narrowing them or do something else funky we don't envision.

  • We will still need something similar to the sample_atomtypes method, but this seems like it should be in a sampler that will use the methods listed above to make changes?

Yes. First you have to have the way to represent everything, then you can worry about how to make changes to it, but it's a separate tool.

  • Is a string the best option to indicate the connection-type?

Hmm. You could also represent this (at least for certain bond types) in your graph by having multiple connections, i.e. a double bond would be two edges connecting atoms, a single bond one, triple bond three, etc. But that seems like too much work for not much gain. I'd think you'd represent it as just a label on the edges that you do have.

  • How to combine the "and" and "or" atom decorators, discussed in the next comment in more detail. List of atoms with shared decorators or decorated atoms that are listed together?

add decorators to each base type then "or" with other decorated base types. For example: [#7&+1,#6] which is in the OpenEye SMARTS tutorial. decorators to a list of "or'd" atoms, such as [#7,#8;-1] which prioritizes the , between the atoms first.

I'm not sure whether there is functionally a difference. You probably need to play around with these on some examples and see. I can see it being easier to find groups like the second one by chance since then we get to decorate a group of elements at once, which is good if the decorator is charge. But I could see this being bad if the decorator is connectivity, i.e. connectivity probably only matters one element at a time.

One other thing I thought of when writing this that I didn't mention in my earlier comment which seems important. It's perhaps obvious. But, labeled atoms are always consecutive (adjacent in the graph), right? i.e. we're not going to end up with [#6:1]-[#8]-[#6:2], though we might have [#6:1](-[#8])-[#6:2]. Presumably this means you take care of labeling at initialization of a particular pattern (i.e. an angle starts off with three labeled but empty atoms, like X-X-X, and then additional detail gets added by adding unlabeled branches or specifying what elements are in each position as we make copies and begin refining them).

bannanc commented 8 years ago

1) and 3) I think moving to Atoms which are made of (possibly decorated) Elements would be fine. I don't like the key word AtomCombo in our chemical environment a node is a single atom (who's type could be something like [#6,#7,#8] which is carbon or nitrogen, or oxygen.

4) atoms will have a attribute atom.label that will be the number if it is labeled and None otherwise.

5) I have AsSMIRKS in my notes, but didn't comment here. Certainly, a method for writing SMARTS strings could be available for the atomic chemical environment, but I'm not sure it would be worth the effort.

6) I included ChangeABond in my initial set which I imagined doing something similar to this.

7) I agree with moving away from the word basetype, it's not always clear what it means.

davidlmobley commented 8 years ago

1) and 3) I think moving to Atoms which are made of (possibly decorated) Elements would be fine. I don't like the key word AtomCombo in our chemical environment a node is a single atom (who's type could be something like [#6,#7,#8] which is carbon or nitrogen, or oxygen.

OK, yes, I agree with this 100%. (As I read further I got more and more clear why it made since to have an Atom be an Atom even if it was composed of an OR of atoms. But, it does make it more confusing what element is there, so MAYBE it is worth having a separate Element container, or at least it is worth tracking what element the Atom is unless it is OR-d.

bannanc commented 8 years ago

Should we continue trying to make these have a human readable format? As we've seen the ones for H1, H2, H3 are not really understandable anymore.

simplified SMARTS for just the labeled atoms, i.e. [#1]-[#6]=[#6]

So I should probably have been more specific about why this is something to address early. Right now everything that is in .smarts files has a SMARTS and associated name. My question here also included, should I continue to incorporate these "names".

For example, if we want to maintain english we need input in the form that includes it so a set of decorators could be:

[ ['X2', 'connectivity 2'], 
['H0', 'zero neighboring hydrogens'] ]

Where if we aren't tracking english names the input lists would only be of the form ['X2', 'H0']

davidlmobley commented 8 years ago

@bannanc - I think the files should still be formatted as now since it's still useful to explain in the files what the different things mean. However, I think that we shouldn't bother constructing descriptive names as we WANT to discover patterns that are too complicated to easily and automatically describe in English.

bannanc commented 8 years ago

I'm not sure whether there is functionally a difference. You probably need to play around with these on some examples and see. I can see it being easier to find groups like the second one by chance since then we get to decorate a group of elements at once, which is good if the decorator is charge. But I could see this being bad if the decorator is connectivity, i.e. connectivity probably only matters one element at a time.

I think this is a place we will need to consult with @cbayly13 we can address it when him and I are back in the office. However, I would argue charge also needs to be associated with the individual atom. For example, a +1 nitrogen and a neutral carbon should have similar angles right?

Once again, we should probably wait to discuss this, but I think we should be able to rework fairly easily regardless of what is in this first mock-up.

jchodera commented 8 years ago

I wasn't able to catch up on this whole conversation, but want to argue the opposite of what @davidlmobley is arguing: What if we kept this as simple as possible---with zero, one, or more totally arbitrary SMARTS attributed allowed in the AND and OR slots----and ran with that to see how far we get?

I am worried that if we build in too much chemistry up front---however well-informed----we will MISS OUT on the opportunity to see whether it is really necessary or not. We may never have another chance to find out.

If we build in the idea that each Atom object can only match a specific element, we miss out on wildcard torsions and condemn ourselves to parameter hell right from the start. If we allow exceptions, we have to deal with exceptions, making everything more complicated.

What if we just do the minimal thing, focus on crafting the most useful decorators, and see what happens?

I'm all in favor of building in EXTENSIBILITY (eg the ability to add constraints as we find they are necessary) and FUNCTIONALITY (eg convenience functions to return atom elemental identities if needed), but these should be things we add as we find they are useful in simplifying the code, rather than chemical constraints on what we can do.

Most of the efficiency will come in good design of the proposal moves anyway.

bannanc commented 8 years ago

If we build in the idea that each Atom object can only match a specific element, we miss out on wildcard torsions and condemn ourselves to parameter hell right from the start. If we allow exceptions, we have to deal with exceptions, making everything more complicated.

@jchodera , I don't think @davidlmobley was suggesting we limit the atoms to be elements, but that atoms are made up of elements. I think that this might be an interesting thing to consider later, especially if (well I think when) we find that we need to teach it more chemistry so we efficiently explore "real" chemical space.

What if we kept this as simple as possible---with zero, one, or more totally arbitrary SMARTS attributed allowed in the AND and OR slots----and ran with that to see how far we get?

I had still been thinking about this from the chemistry side, that I needed to know what was being ANDed or ORed together, but from the coding side, I will implement both list and join them with ',' or ';' in AsSMIRKs then we can make decisions later about what was going into them.

I wasn't able to catch up on this whole conversation

I'm working out of office for a week so there was a bit more on here than usual since we normally talk together and then post what we discussed to github. I think I have a good enough picture of what we want to make a draft of the API, I'll make a WIP pull request when I have something worth sharing, probably the end of next week.

davidlmobley commented 8 years ago

@jchodera

I wasn't able to catch up on this whole conversation, but want to argue the opposite of what @davidlmobley is arguing: What if we kept this as simple as possible---with zero, one, or more totally arbitrary SMARTS attributed allowed in the AND and OR slots----and ran with that to see how far we get?

This is misunderstanding my point slightly. I'm not saying that we don't want to allow totally arbitrary combinations of things in the AND and OR slots, but we DO want the infrastructure in place so that IF AND WHEN we want to facilitate chemistry we don't have to totally rewrite the whole thing. In other words, if I have the capability to know what element is in a particular place if an element is there then it will make using chemistry a lot easier later.

Particularly, if we don't have this capability in place, then we will never be able to do things like limit a carbon atom to only making four bonds, and so on.

I am worried that if we build in too much chemistry up front---however well-informed----we will MISS OUT on the opportunity to see whether it is really necessary or not. We may never have another chance to find out.

We agree here. We don't want to build this in unless we need it. We only want the infrastructure in place so that we don't have to rewrite the whole thing again if we DO need it.

If we build in the idea that each Atom object can only match a specific element, we miss out on wildcard torsions and condemn ourselves to parameter hell right from the start. If we allow exceptions, we have to deal with exceptions, making everything more complicated.

I wasn't proposing each Atom object can only match a specific element (but maybe I wasn't sufficiently clear on this); wildcards should definitely be allowed. But, if an atom DOES contain only a specific element, I want to be able to track what element it is.

I'm all in favor of building in EXTENSIBILITY (eg the ability to add constraints as we find they are necessary) and FUNCTIONALITY (eg convenience functions to return atom elemental identities if needed), but these should be things we add as we find they are useful in simplifying the code, rather than chemical constraints on what we can do.

We're not proposing adding constraints at this point. We just want to track elemental identity when applicable (i.e. when it's not a wildcard) as all of us on this end are convinced we're going to eventually need to have some chemistry in place to avoid too big a combinatorial explosion. We don't want to build the chemistry in until that happens, and we, like you, would rather not build it in unless we have to.

Most of the efficiency will come in good design of the proposal moves anyway.

We agree with this, but that's also almost certainly going to involve an element of not proposing too many things which are chemically impossible (carbons with ten bonds, hydrogens with three bonds, etc.). Again, we absolutely plan to proceed in as simple a way as possible until we have to do otherwise, but it's fairly clear that it will.

Think of a torsion, if you will. We have four labeled atoms, and each involves an environment with alpha substituents and beta substituents. Let's not work with SMIRKS/SMARTS for a moment and just write this this way: X1 ~ X2 ~ X3 ~ X4 for the torsion Now, if each atom has up to N alpha substituents and each alpha substituent has up to N beta substituents, well, the possibilities explode pretty quickly. Let's just think of position X1 for now, and imagine we have N=4 (maximum of four connected atoms) and we have an "alphabet" of just five elements (carbon, hydrogen, oxygen, nitrogen and, let's say, sulfur). How many possible combinations are there here for just the X1 position?

Well, X1 can have N-1=3 other connected atoms, so three different alpha positions, each of which has five possibilities, so we're up to 5*(5)^3=625 possibilities at X1 just considering alpha substituents. But each alpha substituent (of which there are three) each has up to N-1=3 beta substituents, each of which has five possibilities, so considering beta substituents we pick up another (5^3)^3=15625 possibilities, so we're already at 9.7 million possible combinations at the X1 position alone. If you take (9.7e6)^4, well, you don't want to. (Update: Maybe my math here is wrong as we have nine beta substituents so I think it should be 5^9 for the number of possible beta substituents, which is 1.9e6 combinations at the beta position alone, which is even worse than I wrote. But the point is still the same.)

And this is using ONLY five undecorated elemental types and only allowing each atom to have up to four bonds, which will be impossible unless we keep track of elemental types when appropriate.

Hopefully this makes clear why we think we're going to need somewhat more chemistry eventually in order to design good move proposals and narrow the search space.

davidlmobley commented 8 years ago

@jchodera , I don't think @davidlmobley was suggesting we limit the atoms to be elements, but that atoms are made up of elements. I think that this might be an interesting thing to consider later, especially if (well I think when) we find that we need to teach it more chemistry so we efficiently explore "real" chemical space.

Yes, atoms are made up of Elements which are sometimes (often) specific elements, but can also be wildcards which correspond to no specific element (i.e. bonds, angles, torsions would initially all start off as wildcards which match everything).

jchodera commented 8 years ago

What about cases where we want to allow several elements?

And why make elements privileged over, say, electronegative atoms? Or aromatic atoms?

davidlmobley commented 8 years ago

@jchodera :

What about cases where we want to allow several elements?

I'm saying that the basic building blocks of an Atom in the graph are units which themselves might or might not be these particular things which we happen to call elements. They might be "any element", or they might be "this specific element" (carbon for example), and if the latter, that implies certain rules about how they can connect to one another.

And why make elements privileged over, say, electronegative atoms? Or aromatic atoms?

I have no strong feeling about making them privileged. I'm just saying that when we HAVE info about allowed connectivity, it will narrow our search dramatically, and in many cases we DO have that information. This is not just a string language where all of the strings are interchangeable; we know a lot more about what is possible for the particular string #6 than we do for the string * or for !#1

Would you be happier with this variant -- an Atom is composed of one or more Unit objects; if more than one, they are OR'd together. A Unit object carries an .MaxConnections which can be an integer, or it can be None, meaning that it can have any number of connections.

Allowed Units could be defined rather similarly to existing basetypes, but they also would include * which represents any atom (and has .MaxConnections = None) and they could also include inverses such as !#1, again with .MaxConnections = None.

Does that help it be more palatable? Again, I am NOT saying that we want to use this information NOW, I'm just saying that if we have the framework for it in place, it will make it a lot easier to use this info when (or in your view, if) we need it.

Again, we're totally agreed we don't want to use this info unless we need it. It's just that we believe we're going to need it, so if we're writing this infrastructure now we might as well make it so it can accommodate connectivity info.

jchodera commented 8 years ago

I don't think we should be trying to optimize this until we know it is going to be a problem. I don't believe we yet have evidence this will be a problem.

Also, in the specific example you give regarding the maximum number of connections, this is totally unnecessary---the additional infrastructure you propose to add will just add technical debt but won't make it more efficient. If the proposal scheme adds more connections than a node can support (e.g. because the node is labeled with #6 and a fifth connection is proposed), it will be rejected because it does not type any molecules in the set. This may take a few milliseconds longer than checking against the computed .MaxConnections, but this is likely not the rate-limiting step right now, and the simple rule (rejecting because it does not type any molecules in the set) gives us the same behavior you suggest from the more complex system.

I think we should keep it as simple as possible and only aim for more complexity if it is going to bring us order-of-magnitude speedups.

davidlmobley commented 8 years ago

@jchodera : I was thinking about this a lot to try and figure out why you are so convinced this is not an issue (i.e. a few milliseconds delay over each of a billion attempts, which is the ballpark we'll be in when we start looking at atom typing/chemical environment exploration for substantial numbers of real molecules, is a pretty significant amount of time) and I think I finally get what you're saying. It's not that inefficiency here is no problem, but it's that (once this is hooked to a parameter exploration framework) any amount of computer time we use sampling over SMARTS/SMIRKS to find potentially interesting new patterns is going to be basically negligible compared to what happens after we DO propose a new SMARTS/SMIRKS and associated parameters - specifically, we'll be recomputing some set of properties which will certainly involve substantial computer time. Even if only 1 in 100 or 1 in 1000 SMIRKS proposals types anything, then we'll go off and run a rather expensive set of calculations on the result and any expense of generating the proposal will be tiny compared to the cost of evaluating whether introducing a new chemical environment and parameters is worthwhile.

Is that what you're saying?

I think we should keep it as simple as possible and only aim for more complexity if it is going to bring us order-of-magnitude speedups.

I do think there is at least an order of magnitude in SMIRKS sampling speed (which may be irrelevant) to be had here if we track the number of possible connections (and things like which decorators can go with which element - i.e. a variety of decorators don't make sense for hydrogen), because the number of possible SMIRKS patterns which are valid but make no chemical sense is SO VAST compared to the number of patterns which actually make chemical sense (the difference is more than an order of magnitude certainly). But as noted above, once SMIRKS sampling is coupled to parameter exploration we're going to spend a ton of CPU time there so SMIRKS sampling speed just may not matter.

jchodera commented 8 years ago

Is that what you're saying?

Yes!

I do think there is at least an order of magnitude in SMIRKS sampling speed (which may be irrelevant) to be had here if we track the number of possible connections (and things like which decorators can go with which element - i.e. a variety of decorators don't make sense for hydrogen),

I think there are lots of things we can do to speed things up! But there may be a simpler approach that is more broadly applicable than focusing on whether and which elements imply limits on numbers of edges incident to each node.

That's something we can certainly add---reject immediately if more edges than are chemically feasible are present---but we may be able to make this logic more easily extensible to build in more sophisticated fast rejection.

As an example, the current SMARTY will keep track of atom types that don't match anything in the dataset and rapidly reject if one of these is proposed in the future. This is a very simple rule that learns a lot about what kinds of chemistries are not possible (at least in our dataset).

We could probably come up with a more extensible way to "plug in" multiple evaluations that quickly evaluate our chemical graph before going on to type molecules---allowing the incident edges thing to be one of these---but for now, I don't think it's going to be rate limiting since the SMIRKS matching is pretty fast compared to any property computations or even the maximum subgraph computation.

davidlmobley commented 8 years ago

OK, so I think I agree that it doesn't make sense to do much/anything about this now. Except I still worry about making it difficult for ourselves later.

Is there anything you think it would make sense to put into the API to make it easier to deal with these issues in an elegant way in the future if we need to without adding "technical debt" now? I still have some reservations about thinking of the basic building block in our graph (the things we use to build up an Atom) as just a string rather than as something that might be a chemical object (an element, a wildcard matching any element, or a wildcard matching a subset of elements). Is there a way we could get the best of both worlds?

As an example, the current SMARTY will keep track of atom types that don't match anything in the dataset and rapidly reject if one of these is proposed in the future. This is a very simple rule that learns a lot about what kinds of chemistries are not possible (at least in our dataset).

On this - do you have a sense of how well lists (which are what are currently used for this) handle memory? I've been a bit worried that these will blow up at some point. Certainly it seems clear we wouldn't want to store a list of what torsions don't match any torsions in our set, as that would surely cause memory problems, but I'm not sure how much of a concern this could get to be for strings for individual Atoms (potentially decorated ORs of pattern+decorator). There are still a lot of combinations there, but I don't have any sense of how efficient/inefficient lists are in terms of memory usage.

jchodera commented 8 years ago

I still have some reservations about thinking of the basic building block in our graph (the things we use to build up an Atom) as just a string rather than as something that might be a chemical object (an element, a wildcard matching any element, or a wildcard matching a subset of elements). Is there a way we could get the best of both worlds?

As I understand it, an Atom holds a set of strings that are ORed together (base types), and this set gets ANDed with another set of strings that are ANDed together themselves (decorators). I think that may be more complicated than we need, but the SMARTS patterns in @cbayly13's slides suggested this complexity was necessary. The Atom then has a atom.get_SMARTS() function that compiles these into a SMARTS/SMIRKS string, and the whole chemical graph has a environment.get_SMARTS() function that assembles the whole pattern.

What if we do the following?

This would allow us to add a whole host of methods to the chemical graph environment to try out various simple ideas for adding chemical intuition to speed things up.

jchodera commented 8 years ago

On this - do you have a sense of how well lists (which are what are currently used for this) handle memory?

set uses a hashing scheme to deliver O(1) performance for checking if something is a member of the set.

If the set gets larger than memory can handle, we can start to be more clever about handling things, or migrate to a data store that uses fast lookups to disk.

bannanc commented 8 years ago

I've been trying to follow this back and forth. So far, I like @jchodera 's idea of an is_valid() method for moves in the chemicalEnvironment class that for now can just always return True. It would allow us to build in more complexity later.

@davidlmobley currently the atom class is very simple, basically just things that can be AND'd or OR'd into a SMARTS string and it stores the label number if there is one. I am not convinced that adding an extra class is necessary for elements. I'm thinking it will add too much complexity. When (If) we decide we need information about the element numbers I think we can build it into the Atom class instead.

The current conundrum I'm facing is what to call the things we are ANDing and ORing together in the Atom class, right now they are called "basetypes" and "decorators", but since we want to move away from the word basetype I sort of want something different here. Also, I think there is a chance that the things being OR'd together might need the option to be decorated elements. What do you guys think?

davidlmobley commented 8 years ago

@bannanc - I'm quite sick today so responses are slow. Sorry.

I've been trying to follow this back and forth. So far, I like @jchodera 's idea of an is_valid() method for moves in the chemicalEnvironment class that for now can just always return True. It would allow us to build in more complexity later.

Yes, I like this a lot. It will allow us to extend in ways that make sense later if we want to, and we don't have to have extra baggage now. So that's what we should be shooting for I think.

@davidlmobley currently the atom class is very simple, basically just things that can be AND'd or OR'd into a SMARTS string and it stores the label number if there is one. I am not convinced that adding an extra class is necessary for elements. I'm thinking it will add too much complexity. When (If) we decide we need information about the element numbers I think we can build it into the Atom class instead.

Yes, I think I'm on board with this now. One other point about this (tangentially) is that as we move to bonds, angles, and torsions we'll be using wildcards a lot more often than we were for "atom types". (But we should probably also run, as a check, the atom type tests we have now if we start everything with a wildcard such as *).

The current conundrum I'm facing is what to call the things we are ANDing and ORing together in the Atom class, right now they are called "basetypes" and "decorators", but since we want to move away from the word basetype I sort of want something different here. Also, I think there is a chance that the things being OR'd together might need the option to be decorated elements. What do you guys think?

Yes, I don't like the "basetypes" name here partly because it imagines a particular usage and suggests the types are special somehow. Remember these are just representing atoms. So maybe we should just be calling them "atom descriptor" plus "decorator" (so an Atom would be a combination of atom descriptors and decorators). Or "building block" plus decorator.

And yes, as noted above there are two possible OR routes we might go if I remember correctly, where we either OR together decorated descriptors, or we OR together descriptors and decorate them. I think this can only be resolved by talking to Christopher, so for now I'd just implement your favorite of these options (i.e. the one we use the most in AlkEthOH's SMIRFF implementation would be a good choice) and we can switch or implement both if needed after talking to Chris.

bannanc commented 8 years ago

I'm quite sick today so responses are slow. Sorry.

I hope you're feeling better, but don't feel pressured to get back to us.

start everything with a wildcard such as *

At the moment, I've set it so that if the "basetypes" set is empty then a * is used for the atom when converting to SMARTS/SMIRKS, but isn't actually stored in the set of strings.

(i.e. the one we use the most in AlkEthOH's SMIRFF implementation would be a good choice)

We can get AlkEthOH with just neighboring atoms and not descriptors, sine there aren't different types of connections or aromaticity for any of the atoms yet.

davidlmobley commented 8 years ago

Closed by #72