Closed davidlmobley closed 8 years ago
This sounds like exactly what we need. We can consider chemical graphs, where each node is annotated with SMARTS strings, and a Monte Carlo proposal process that can create and destroy connected nodes or modify the SMARTS strings on one or more nodes. networkx
may be useful in representing these graphs since we already use it.
This might also call for an additional tool where you apply a set of SMIRKS to a huge druglike chemical dataset to see if we can type everything. We might be able to learn rules about which Monte Carlo moves are efficient (in the sense of proposing chemically reasonable SMIRKS when starting from chemically reasonable parent SMIRKS) using this.
The immediate problem we are focused on is how to even propose sensible SMARTS (or here, SMIRKS) which are suitably decorated to cover what we want, EVEN IF most of the time they don't make chemical sense. Having them mostly make chemical sense will also be an important optimization.
I'll mention the "molecules as graphs" idea to Christopher right now to see if he thinks it helps with this at all.
@jchodera : On discussion with Christopher I realized I'm confused. What are the chemical graphs you're envisioning? Are these graphs molecules, or the bipartite graph currently in Smarty where we're using atom types for graph coloring, or something else?
Maybe this will be a topic for our catch-up call tomorrow.
Update: @cbayly13 will present some slides on how he thinks this might work at our Wednesday call, and then we will turn this into a detailed proposal for implementation.
@cbayly13 We discussed this a fair amount during the call today, I'm working on organizing my notes and writing up how we plan to implement at ChemicalEnvironment
object which will generally use networkx graph to help organize the chemical environments we have been talking about. I'm going to work on detailed proposal and include it in a comment here when it is more organized.
@cbayly13 's slides for the call, for the record.
CHEMICAL ENVIRONMENT OBJECT
This is a first draft at what I think needs to be included in the chemical environment object. @davidlmobley and @jchodera your comments would be appreciated. I think I will just keep editing this comment from the discussion I'm sure is to follow.
I am working on developing a chemical environment object. As discussed, it will use networkx graph objects to organize the chemical environment. It will have nodes that are atoms and edges that are bonds, and will include methods for making "moves" in the chemical environment space.
It will be initialized with a "connection-type" such as atom, bond, angle, torsion along with a list of basetypes and decorators.
This is the list I have of methods we will want:
PickAnAtom(self)
: pick an atom (node) from the graphPickABond(self)
: pick a bond (edge) from the graphAddAnAtom(self, atom)
: create a new node attached to the specified atom, I think this will have a generic bond, but the new atom will be from the basetypeChangeABond(self, bond)
: add a bond type ('-','=',':','#') or decorate the bond type ('@') are there other bond decorators? should that be two methods? ChangeAnAtom(self, atom, newdecorator = None, newBaseType = None)
: depending on how we decide to format atoms (see comments below). It will add the specified decorator and/or basetype to the specified atom. MakeSMIRKString(self)
: converts the graph object into a SMIRKS string (should we also have a method for MakeSMARTSString
?Implement chemical environment:
basetypes = AtomTyper.read_typelist(basetype_filename)
decorators = AtomTyper.read_typelist(decorator_filename)
torsion = chemicalEnvironment('torsion', basetypes, decorators)
Some things to consider for this implementation:
basetypes
or bondtypes
that will be used to initialize the chemical environment object?sample_atomtypes
method, but this seems like it should be in a sampler that will use the methods listed above to make changes?connection-type
?HOW TO STRUCTURE ATOMS (nodes):
Same as the comment above, I think it will be easiest to edit this as we establish things in the discussion I am sure will follow
I have been thinking about how to structure the nodes (atoms), I'm still learning how to use the graph objects, so maybe their are properties that already exist in the graph objects we can use, maybe they need to be a separate object, or can be a dictionary. There are some considerations below for how to use the list of basetypes and decorators.
Right now we know at least three properties an atom will need:
basetypes
: I'm imagining these as a list of strings that will be "or'd" together so ['#6','#8']
would be "[# 6,# 8]"
decorators
: these are a list of strings for things like connectivity or number of hydrogens. label
: ":#" for the SMIRK, or None (or "") if it's not a marked atom. I have a syntax question that might need @cbayly13 to answer. To give some context:
basetypes
which could be elemental or a list of elements (such as the list of electron withdrawing groups). decorators
for things like X4, H0, A, a, maybe the Ring bond count, which we discussed "anding" together. However, there are still two ways to do this,
"[#7&+1,#6]"
which is in the OpenEye SMARTS tutorial. "[#7,#8;-1]"
which prioritizes the ,
between the atoms first.Thanks for taking a stab at this, @bannanc!
You might check out the PEP8 style guide for Python coding conventions. If we mostly stick to this (e.g. regarding method names), it will help keep coding styles consistent. (I realize I probably screwed this up already with ForceField
since I started from simtk.openmm.app.ForceField
, which uses CamelCase
convention instead of understrike_naming
convention, but this was done for consistency with the OpenMM version.)
I like this list of methods. Here's what I would do if I was drafting the API:
# Use NetworkX for handling graphs
import networkx as nx
class ChemicalEnvironment(object):
"""Chemical environment abstract base class that matches an atom, bond, angle, etc.
"""
class Atom(object):
"""Atom representation, which may have some base (logical OR) and decorator (logical AND) properties.
Type will be OR(basetypes) AND AND(decorators).
Properties
-----------
basetypes : set of string
The base types that will be combined with logical OR
decorators : set of string
The decorators that will be combined with logical AND
"""
def __init__(self):
"""Initialize an Atom object with empty basetypes and decorators.
"""
self.basetypes = set()
self.decorators = set()
def asSMARTS(self, index=None):
"""Return the atom representation as SMARTS/SMIRKS.
Parameters
----------
index : int, optional, default=None
If not None, the specified index will be attached as a SMIRKS index (e.g. '[#6:1]')
Returns
--------
smarts : str
The SMARTS/SMIRKS string
"""
pass
class Bond(object):
"""Bond representation, which may have base (OR) and decorator (AND) types.
"""
# implementation similar to Atom
pass
def __init__(self):
"""Initialize a chemical environment abstract base class.
"""
# Create an empty graph which will store Atom objects.
self._graph = nx.Graph()
def asSMARTS(self):
"""Return a SMARTS/SMIRKS representation of the chemical environment.
"""
pass
def selectAtom(self):
"""Select an atom with uniform probability.
"""
pass
def selectBond(self):
"""Select a bond with uniform probability.
"""
pass
def addAtom(self, atom, bond, bondedAtom):
"""Add an atom to the specified target atom.
"""
pass
def removeAtom(self, atom):
"""Remove the specified atom from the chemical environment, along with its associated bond.
"""
pass
class AtomChemicalEnvironment(ChemicalEnvironment):
"""Chemical environment matching a single atom.
"""
def __init__(self):
"""Initialize a chemical environment corresponding to matching a single atom.
"""
# Initialize base class
super(AtomChemicalEnvironment,self).__init__()
# Add a labeled atom corresponding to :1
# TODO
# and so on for BondChemicalEnvironment, AngleChemicalEnvironment, TorsionChemicalEnvironment
Thanks @jchodera I'll take a look at the PEP8. I'll take your suggested API as a starting point and work on this, I'll post with any clarifying questions.
OK, @bannanc , so I'm looking at these proposals in a bit more detail. It looks like a great first pass from both you and John.
Before getting to questions which have been raised, I want to give some random thoughts:
1) Since this is a total rework of everything, we probably do want to take advantage of this chance in order to build in some chemistry at the atomic level to make it easier for ourselves if possible. For example, what if Atom
has a GetElement()
method which can return the element if it is NOT already a combination of multiple types via an "or"? The reason this would be helpful is that it will allow us to do things like say that a carbon can't be connected to six things, etc.
2) Item 1 may be stretching the API above slightly. Shouldn't we also use this opportunity to build in chemistry more thoroughly? What you're calling Atom
in the above API is really AtomCombo
as it's one or more atoms which will be at a particular node in the graph. It seems like it should, itself, consist of several Atoms
, each of which has an element and is possibly decorated.
3) Right now there's a small documentation/clarity problem, as we have the Atom
(AtomCombo
in my terminology) taking:
basetypes : set of string
The base types that will be combined with logical OR
decorators : set of string
The decorators that will be combined with logical AND
but we haven't said whether the basetypes are individually being decorated by the decorators or are collectively being decorated by the decorators - i.e. if I have basetypes ['#6', '#8']
and decorators ['X2']
is that even valid, or do I need two decorators? And if it's valid, do I end up with the SMARTS [#6X2,#8]
, [#6X2,#8X2]
or something else? It seems to me this is taking us too far down the road of "I have to already know about SMARTS syntax in order to create an Atom
", which is partly why I think this should be an AtomCombo
which itself is composed of Atoms
which contain an actual chemical element, an associated individual base type, a decorator, and the names for the base type + decorator, as in item 3 above. And a Bond
is between an AtomCombo
, etc. Alternatively, if this notation is too confusing, I suppose there are some sensible alternative namings, such as Atom
and Element
for the nodes and basic building blocks, respectively, or Node
and Atom
for the nodes and basic building blocks. I actually think Atom
and Element
may make the most sense, since we do add bonds between Atoms
so it makes sense to retain that notation, even though Atoms
are combinations of things.
4) @bannanc , make sure you include the notion of which atoms are labeled in the framework.
5) Minor niggly detail, but for clarity I think we should have the method asSMARTS
be renamed asSMIRKS
since we're going to be using SMIRKS everywhere and the downstream use of these is subtly different, so reminding people that they are SMIRKS rather than SMARTS when they call the function is probably a good idea.
6) (Minor detail we can address later) In John's framework, do we also want changeBond
in addition to addAtom
and removeAtom
? It could be a way to discover derivative types without having to delete and insert atoms - i.e. if we have a [#6:1]-[#6:2](-[#8])-[#8]
then maybe [#6:1]=[#6:2](-[#8])-[#8]
will also be interesting on its own, and it could be a lot faster to get there that way.
7) Terminology: Maybe it's time to get rid of the basetypes
terminology entirely for clarity; everything will consist of elements which may get ORd together to construct more complicated patterns and there are no basetypes.
I'll get to your questions, @bannanc , momentarily in a separate comment.
@bannanc , question responses:
- should we have an optional list of basetypes or bondtypes that will be used to initialize the chemical environment object?
I think at this point the idea is to construct the objects; we'll get to initialization when we start playing with them. That said we will want to initialize with what elements we want to consider and what decorators they can have. Likewise for bond types, I suppose. Initially I wanted to say we should just build these in, but I suppose if someone wants to explore a forcefield where it only makes sense to sample across single bonds we should let them. :)
There are probably two different initializations which need to happen: 1) An initialization of everything which is within our overall scope, i.e. what elements/decorators/bonds can possibly be considered (and their names) 2) An initialization of a set which matches everything, i.e. a starting point which covers everything. Presumably for torsions, this would just be that we have a single torsion which matches X-X-X-X, and for bonds a single bond which looks like X-X, etc.
- Should the chemical Environment object read the basetype files or decorators? I think I like it being general enough that we could feed it lists instead?
You want to separate the objects from the file parsing, IMO, otherwise you lose generality. So initialization happens by passing in arguments. Elements/decorators/bonds can be read from files and then passed in, or input by human and passed in. (Here, since we know what bonds we're interested in, I see no point in reading these from a file.)
- Should we continue trying to make these have a human readable format? As we've seen the ones for H1, H2, H3 are not really understandable anymore.
I would not work that hard at this, but it could be something like outputting the element types for the labeled atoms ONLY along with connectivity info, i.e. [#1:1]-[#6:2](-[#8])=[#6:3]
is "hydrogen singly bonded to carbon doubly bonded to carbon"; maybe we should ditch the English and go with simplified SMARTS for just the labeled atoms, i.e. [#1]-[#6]=[#6]
. English gets too complicated because of grammar considerations.
In other words, I'm sort of undecided. Skip this in your initial implementation, as it will be hard to make it worthwhile, I think.
- Should bond types or bond decorators be an input?
Addressed above I think. Probably we should have these all be inputs just in case someone wants to limit sampling by narrowing them or do something else funky we don't envision.
- We will still need something similar to the sample_atomtypes method, but this seems like it should be in a sampler that will use the methods listed above to make changes?
Yes. First you have to have the way to represent everything, then you can worry about how to make changes to it, but it's a separate tool.
- Is a string the best option to indicate the connection-type?
Hmm. You could also represent this (at least for certain bond types) in your graph by having multiple connections, i.e. a double bond would be two edges connecting atoms, a single bond one, triple bond three, etc. But that seems like too much work for not much gain. I'd think you'd represent it as just a label on the edges that you do have.
- How to combine the "and" and "or" atom decorators, discussed in the next comment in more detail. List of atoms with shared decorators or decorated atoms that are listed together?
add decorators to each base type then "or" with other decorated base types. For example:
[#7&+1,#6]
which is in the OpenEye SMARTS tutorial. decorators to a list of "or'd" atoms, such as[#7,#8;-1]
which prioritizes the , between the atoms first.
I'm not sure whether there is functionally a difference. You probably need to play around with these on some examples and see. I can see it being easier to find groups like the second one by chance since then we get to decorate a group of elements at once, which is good if the decorator is charge. But I could see this being bad if the decorator is connectivity, i.e. connectivity probably only matters one element at a time.
One other thing I thought of when writing this that I didn't mention in my earlier comment which seems important. It's perhaps obvious. But, labeled atoms are always consecutive (adjacent in the graph), right? i.e. we're not going to end up with [#6:1]-[#8]-[#6:2]
, though we might have [#6:1](-[#8])-[#6:2]
. Presumably this means you take care of labeling at initialization of a particular pattern (i.e. an angle starts off with three labeled but empty atoms, like X-X-X, and then additional detail gets added by adding unlabeled branches or specifying what elements are in each position as we make copies and begin refining them).
1) and 3) I think moving to Atoms
which are made of (possibly decorated) Elements
would be fine. I don't like the key word AtomCombo
in our chemical environment a node is a single atom (who's type could be something like [#6,#7,#8]
which is carbon or nitrogen, or oxygen.
4) atoms will have a attribute atom.label that will be the number if it is labeled and None otherwise.
5) I have AsSMIRKS
in my notes, but didn't comment here. Certainly, a method for writing SMARTS strings could be available for the atomic chemical environment, but I'm not sure it would be worth the effort.
6) I included ChangeABond
in my initial set which I imagined doing something similar to this.
7) I agree with moving away from the word basetype, it's not always clear what it means.
1) and 3) I think moving to Atoms which are made of (possibly decorated) Elements would be fine. I don't like the key word AtomCombo in our chemical environment a node is a single atom (who's type could be something like [#6,#7,#8] which is carbon or nitrogen, or oxygen.
OK, yes, I agree with this 100%. (As I read further I got more and more clear why it made since to have an Atom
be an Atom
even if it was composed of an OR of atoms. But, it does make it more confusing what element is there, so MAYBE it is worth having a separate Element
container, or at least it is worth tracking what element the Atom
is unless it is OR-d.
Should we continue trying to make these have a human readable format? As we've seen the ones for H1, H2, H3 are not really understandable anymore.
simplified SMARTS for just the labeled atoms, i.e. [#1]-[#6]=[#6]
So I should probably have been more specific about why this is something to address early. Right now everything that is in .smarts files has a SMARTS and associated name. My question here also included, should I continue to incorporate these "names".
For example, if we want to maintain english we need input in the form that includes it so a set of decorators could be:
[ ['X2', 'connectivity 2'],
['H0', 'zero neighboring hydrogens'] ]
Where if we aren't tracking english names the input lists would only be of the form ['X2', 'H0']
@bannanc - I think the files should still be formatted as now since it's still useful to explain in the files what the different things mean. However, I think that we shouldn't bother constructing descriptive names as we WANT to discover patterns that are too complicated to easily and automatically describe in English.
I'm not sure whether there is functionally a difference. You probably need to play around with these on some examples and see. I can see it being easier to find groups like the second one by chance since then we get to decorate a group of elements at once, which is good if the decorator is charge. But I could see this being bad if the decorator is connectivity, i.e. connectivity probably only matters one element at a time.
I think this is a place we will need to consult with @cbayly13 we can address it when him and I are back in the office. However, I would argue charge also needs to be associated with the individual atom. For example, a +1 nitrogen and a neutral carbon should have similar angles right?
Once again, we should probably wait to discuss this, but I think we should be able to rework fairly easily regardless of what is in this first mock-up.
I wasn't able to catch up on this whole conversation, but want to argue the opposite of what @davidlmobley is arguing: What if we kept this as simple as possible---with zero, one, or more totally arbitrary SMARTS attributed allowed in the AND and OR slots----and ran with that to see how far we get?
I am worried that if we build in too much chemistry up front---however well-informed----we will MISS OUT on the opportunity to see whether it is really necessary or not. We may never have another chance to find out.
If we build in the idea that each Atom object can only match a specific element, we miss out on wildcard torsions and condemn ourselves to parameter hell right from the start. If we allow exceptions, we have to deal with exceptions, making everything more complicated.
What if we just do the minimal thing, focus on crafting the most useful decorators, and see what happens?
I'm all in favor of building in EXTENSIBILITY (eg the ability to add constraints as we find they are necessary) and FUNCTIONALITY (eg convenience functions to return atom elemental identities if needed), but these should be things we add as we find they are useful in simplifying the code, rather than chemical constraints on what we can do.
Most of the efficiency will come in good design of the proposal moves anyway.
If we build in the idea that each Atom object can only match a specific element, we miss out on wildcard torsions and condemn ourselves to parameter hell right from the start. If we allow exceptions, we have to deal with exceptions, making everything more complicated.
@jchodera , I don't think @davidlmobley was suggesting we limit the atoms to be elements, but that atoms are made up of elements. I think that this might be an interesting thing to consider later, especially if (well I think when) we find that we need to teach it more chemistry so we efficiently explore "real" chemical space.
What if we kept this as simple as possible---with zero, one, or more totally arbitrary SMARTS attributed allowed in the AND and OR slots----and ran with that to see how far we get?
I had still been thinking about this from the chemistry side, that I needed to know what was being ANDed or ORed together, but from the coding side, I will implement both list and join them with ',' or ';' in AsSMIRKs then we can make decisions later about what was going into them.
I wasn't able to catch up on this whole conversation
I'm working out of office for a week so there was a bit more on here than usual since we normally talk together and then post what we discussed to github. I think I have a good enough picture of what we want to make a draft of the API, I'll make a WIP pull request when I have something worth sharing, probably the end of next week.
@jchodera
I wasn't able to catch up on this whole conversation, but want to argue the opposite of what @davidlmobley is arguing: What if we kept this as simple as possible---with zero, one, or more totally arbitrary SMARTS attributed allowed in the AND and OR slots----and ran with that to see how far we get?
This is misunderstanding my point slightly. I'm not saying that we don't want to allow totally arbitrary combinations of things in the AND and OR slots, but we DO want the infrastructure in place so that IF AND WHEN we want to facilitate chemistry we don't have to totally rewrite the whole thing. In other words, if I have the capability to know what element is in a particular place if an element is there then it will make using chemistry a lot easier later.
Particularly, if we don't have this capability in place, then we will never be able to do things like limit a carbon atom to only making four bonds, and so on.
I am worried that if we build in too much chemistry up front---however well-informed----we will MISS OUT on the opportunity to see whether it is really necessary or not. We may never have another chance to find out.
We agree here. We don't want to build this in unless we need it. We only want the infrastructure in place so that we don't have to rewrite the whole thing again if we DO need it.
If we build in the idea that each Atom object can only match a specific element, we miss out on wildcard torsions and condemn ourselves to parameter hell right from the start. If we allow exceptions, we have to deal with exceptions, making everything more complicated.
I wasn't proposing each Atom object can only match a specific element (but maybe I wasn't sufficiently clear on this); wildcards should definitely be allowed. But, if an atom DOES contain only a specific element, I want to be able to track what element it is.
I'm all in favor of building in EXTENSIBILITY (eg the ability to add constraints as we find they are necessary) and FUNCTIONALITY (eg convenience functions to return atom elemental identities if needed), but these should be things we add as we find they are useful in simplifying the code, rather than chemical constraints on what we can do.
We're not proposing adding constraints at this point. We just want to track elemental identity when applicable (i.e. when it's not a wildcard) as all of us on this end are convinced we're going to eventually need to have some chemistry in place to avoid too big a combinatorial explosion. We don't want to build the chemistry in until that happens, and we, like you, would rather not build it in unless we have to.
Most of the efficiency will come in good design of the proposal moves anyway.
We agree with this, but that's also almost certainly going to involve an element of not proposing too many things which are chemically impossible (carbons with ten bonds, hydrogens with three bonds, etc.). Again, we absolutely plan to proceed in as simple a way as possible until we have to do otherwise, but it's fairly clear that it will.
Think of a torsion, if you will. We have four labeled atoms, and each involves an environment with alpha substituents and beta substituents. Let's not work with SMIRKS/SMARTS for a moment and just write this this way: X1 ~ X2 ~ X3 ~ X4 for the torsion Now, if each atom has up to N alpha substituents and each alpha substituent has up to N beta substituents, well, the possibilities explode pretty quickly. Let's just think of position X1 for now, and imagine we have N=4 (maximum of four connected atoms) and we have an "alphabet" of just five elements (carbon, hydrogen, oxygen, nitrogen and, let's say, sulfur). How many possible combinations are there here for just the X1 position?
Well, X1 can have N-1=3 other connected atoms, so three different alpha positions, each of which has five possibilities, so we're up to 5*(5)^3=625 possibilities at X1 just considering alpha substituents. But each alpha substituent (of which there are three) each has up to N-1=3 beta substituents, each of which has five possibilities, so considering beta substituents we pick up another (5^3)^3=15625 possibilities, so we're already at 9.7 million possible combinations at the X1 position alone. If you take (9.7e6)^4, well, you don't want to. (Update: Maybe my math here is wrong as we have nine beta substituents so I think it should be 5^9 for the number of possible beta substituents, which is 1.9e6 combinations at the beta position alone, which is even worse than I wrote. But the point is still the same.)
And this is using ONLY five undecorated elemental types and only allowing each atom to have up to four bonds, which will be impossible unless we keep track of elemental types when appropriate.
Hopefully this makes clear why we think we're going to need somewhat more chemistry eventually in order to design good move proposals and narrow the search space.
@jchodera , I don't think @davidlmobley was suggesting we limit the atoms to be elements, but that atoms are made up of elements. I think that this might be an interesting thing to consider later, especially if (well I think when) we find that we need to teach it more chemistry so we efficiently explore "real" chemical space.
Yes, atoms are made up of Elements
which are sometimes (often) specific elements, but can also be wildcards which correspond to no specific element (i.e. bonds, angles, torsions would initially all start off as wildcards which match everything).
What about cases where we want to allow several elements?
And why make elements privileged over, say, electronegative atoms? Or aromatic atoms?
@jchodera :
What about cases where we want to allow several elements?
I'm saying that the basic building blocks of an Atom
in the graph are units which themselves might or might not be these particular things which we happen to call elements. They might be "any element", or they might be "this specific element" (carbon for example), and if the latter, that implies certain rules about how they can connect to one another.
And why make elements privileged over, say, electronegative atoms? Or aromatic atoms?
I have no strong feeling about making them privileged. I'm just saying that when we HAVE info about allowed connectivity, it will narrow our search dramatically, and in many cases we DO have that information. This is not just a string language where all of the strings are interchangeable; we know a lot more about what is possible for the particular string #6
than we do for the string *
or for !#1
Would you be happier with this variant -- an Atom
is composed of one or more Unit
objects; if more than one, they are OR'd together. A Unit
object carries an .MaxConnections
which can be an integer, or it can be None
, meaning that it can have any number of connections.
Allowed Unit
s could be defined rather similarly to existing basetypes
, but they also would include *
which represents any atom (and has .MaxConnections = None
) and they could also include inverses such as !#1
, again with .MaxConnections = None.
Does that help it be more palatable? Again, I am NOT saying that we want to use this information NOW, I'm just saying that if we have the framework for it in place, it will make it a lot easier to use this info when (or in your view, if) we need it.
Again, we're totally agreed we don't want to use this info unless we need it. It's just that we believe we're going to need it, so if we're writing this infrastructure now we might as well make it so it can accommodate connectivity info.
I don't think we should be trying to optimize this until we know it is going to be a problem. I don't believe we yet have evidence this will be a problem.
Also, in the specific example you give regarding the maximum number of connections, this is totally unnecessary---the additional infrastructure you propose to add will just add technical debt but won't make it more efficient. If the proposal scheme adds more connections than a node can support (e.g. because the node is labeled with #6
and a fifth connection is proposed), it will be rejected because it does not type any molecules in the set. This may take a few milliseconds longer than checking against the computed .MaxConnections
, but this is likely not the rate-limiting step right now, and the simple rule (rejecting because it does not type any molecules in the set) gives us the same behavior you suggest from the more complex system.
I think we should keep it as simple as possible and only aim for more complexity if it is going to bring us order-of-magnitude speedups.
@jchodera : I was thinking about this a lot to try and figure out why you are so convinced this is not an issue (i.e. a few milliseconds delay over each of a billion attempts, which is the ballpark we'll be in when we start looking at atom typing/chemical environment exploration for substantial numbers of real molecules, is a pretty significant amount of time) and I think I finally get what you're saying. It's not that inefficiency here is no problem, but it's that (once this is hooked to a parameter exploration framework) any amount of computer time we use sampling over SMARTS/SMIRKS to find potentially interesting new patterns is going to be basically negligible compared to what happens after we DO propose a new SMARTS/SMIRKS and associated parameters - specifically, we'll be recomputing some set of properties which will certainly involve substantial computer time. Even if only 1 in 100 or 1 in 1000 SMIRKS proposals types anything, then we'll go off and run a rather expensive set of calculations on the result and any expense of generating the proposal will be tiny compared to the cost of evaluating whether introducing a new chemical environment and parameters is worthwhile.
Is that what you're saying?
I think we should keep it as simple as possible and only aim for more complexity if it is going to bring us order-of-magnitude speedups.
I do think there is at least an order of magnitude in SMIRKS sampling speed (which may be irrelevant) to be had here if we track the number of possible connections (and things like which decorators can go with which element - i.e. a variety of decorators don't make sense for hydrogen), because the number of possible SMIRKS patterns which are valid but make no chemical sense is SO VAST compared to the number of patterns which actually make chemical sense (the difference is more than an order of magnitude certainly). But as noted above, once SMIRKS sampling is coupled to parameter exploration we're going to spend a ton of CPU time there so SMIRKS sampling speed just may not matter.
Is that what you're saying?
Yes!
I do think there is at least an order of magnitude in SMIRKS sampling speed (which may be irrelevant) to be had here if we track the number of possible connections (and things like which decorators can go with which element - i.e. a variety of decorators don't make sense for hydrogen),
I think there are lots of things we can do to speed things up! But there may be a simpler approach that is more broadly applicable than focusing on whether and which elements imply limits on numbers of edges incident to each node.
That's something we can certainly add---reject immediately if more edges than are chemically feasible are present---but we may be able to make this logic more easily extensible to build in more sophisticated fast rejection.
As an example, the current SMARTY will keep track of atom types that don't match anything in the dataset and rapidly reject if one of these is proposed in the future. This is a very simple rule that learns a lot about what kinds of chemistries are not possible (at least in our dataset).
We could probably come up with a more extensible way to "plug in" multiple evaluations that quickly evaluate our chemical graph before going on to type molecules---allowing the incident edges thing to be one of these---but for now, I don't think it's going to be rate limiting since the SMIRKS matching is pretty fast compared to any property computations or even the maximum subgraph computation.
OK, so I think I agree that it doesn't make sense to do much/anything about this now. Except I still worry about making it difficult for ourselves later.
Is there anything you think it would make sense to put into the API to make it easier to deal with these issues in an elegant way in the future if we need to without adding "technical debt" now? I still have some reservations about thinking of the basic building block in our graph (the things we use to build up an Atom
) as just a string rather than as something that might be a chemical object (an element, a wildcard matching any element, or a wildcard matching a subset of elements). Is there a way we could get the best of both worlds?
As an example, the current SMARTY will keep track of atom types that don't match anything in the dataset and rapidly reject if one of these is proposed in the future. This is a very simple rule that learns a lot about what kinds of chemistries are not possible (at least in our dataset).
On this - do you have a sense of how well lists (which are what are currently used for this) handle memory? I've been a bit worried that these will blow up at some point. Certainly it seems clear we wouldn't want to store a list of what torsions don't match any torsions in our set, as that would surely cause memory problems, but I'm not sure how much of a concern this could get to be for strings for individual Atoms
(potentially decorated ORs of pattern+decorator). There are still a lot of combinations there, but I don't have any sense of how efficient/inefficient lists are in terms of memory usage.
I still have some reservations about thinking of the basic building block in our graph (the things we use to build up an Atom) as just a string rather than as something that might be a chemical object (an element, a wildcard matching any element, or a wildcard matching a subset of elements). Is there a way we could get the best of both worlds?
As I understand it, an Atom
holds a set
of strings that are ORed together (base types), and this set gets ANDed with another set
of strings that are ANDed together themselves (decorators). I think that may be more complicated than we need, but the SMARTS patterns in @cbayly13's slides suggested this complexity was necessary. The Atom
then has a atom.get_SMARTS()
function that compiles these into a SMARTS/SMIRKS string, and the whole chemical graph has a environment.get_SMARTS()
function that assembles the whole pattern.
What if we do the following?
is_valid()
to the graph that calls any registered methods ti check to see if this can be rejected quickly.Atom
objects that simplify the above checks (e.g. if there is a limit on the number of edges because an element is specified, return that; otherwise None
or some large number)This would allow us to add a whole host of methods to the chemical graph environment to try out various simple ideas for adding chemical intuition to speed things up.
On this - do you have a sense of how well lists (which are what are currently used for this) handle memory?
set
uses a hashing scheme to deliver O(1) performance for checking if something is a member of the set.
If the set
gets larger than memory can handle, we can start to be more clever about handling things, or migrate to a data store that uses fast lookups to disk.
I've been trying to follow this back and forth. So far, I like @jchodera 's idea of an is_valid() method for moves in the chemicalEnvironment class that for now can just always return True. It would allow us to build in more complexity later.
@davidlmobley currently the atom class is very simple, basically just things that can be AND'd or OR'd into a SMARTS string and it stores the label number if there is one. I am not convinced that adding an extra class is necessary for elements. I'm thinking it will add too much complexity. When (If) we decide we need information about the element numbers I think we can build it into the Atom class instead.
The current conundrum I'm facing is what to call the things we are ANDing and ORing together in the Atom class, right now they are called "basetypes" and "decorators", but since we want to move away from the word basetype I sort of want something different here. Also, I think there is a chance that the things being OR'd together might need the option to be decorated elements. What do you guys think?
@bannanc - I'm quite sick today so responses are slow. Sorry.
I've been trying to follow this back and forth. So far, I like @jchodera 's idea of an is_valid() method for moves in the chemicalEnvironment class that for now can just always return True. It would allow us to build in more complexity later.
Yes, I like this a lot. It will allow us to extend in ways that make sense later if we want to, and we don't have to have extra baggage now. So that's what we should be shooting for I think.
@davidlmobley currently the atom class is very simple, basically just things that can be AND'd or OR'd into a SMARTS string and it stores the label number if there is one. I am not convinced that adding an extra class is necessary for elements. I'm thinking it will add too much complexity. When (If) we decide we need information about the element numbers I think we can build it into the Atom class instead.
Yes, I think I'm on board with this now. One other point about this (tangentially) is that as we move to bonds, angles, and torsions we'll be using wildcards a lot more often than we were for "atom types". (But we should probably also run, as a check, the atom type tests we have now if we start everything with a wildcard such as *
).
The current conundrum I'm facing is what to call the things we are ANDing and ORing together in the Atom class, right now they are called "basetypes" and "decorators", but since we want to move away from the word basetype I sort of want something different here. Also, I think there is a chance that the things being OR'd together might need the option to be decorated elements. What do you guys think?
Yes, I don't like the "basetypes" name here partly because it imagines a particular usage and suggests the types are special somehow. Remember these are just representing atoms. So maybe we should just be calling them "atom descriptor" plus "decorator" (so an Atom
would be a combination of atom descriptors and decorators). Or "building block" plus decorator.
And yes, as noted above there are two possible OR routes we might go if I remember correctly, where we either OR together decorated descriptors, or we OR together descriptors and decorate them. I think this can only be resolved by talking to Christopher, so for now I'd just implement your favorite of these options (i.e. the one we use the most in AlkEthOH's SMIRFF implementation would be a good choice) and we can switch or implement both if needed after talking to Chris.
I'm quite sick today so responses are slow. Sorry.
I hope you're feeling better, but don't feel pressured to get back to us.
start everything with a wildcard such as *
At the moment, I've set it so that if the "basetypes" set is empty then a *
is used for the atom when converting to SMARTS/SMIRKS, but isn't actually stored in the set of strings.
(i.e. the one we use the most in AlkEthOH's SMIRFF implementation would be a good choice)
We can get AlkEthOH with just neighboring atoms and not descriptors, sine there aren't different types of connections or aromaticity for any of the atoms yet.
Closed by #72
We've realized that to go past AlkEthOH without trying to handle WAY TOO MANY special cases in construction of decorator proposals, we are going to need to move past treating SMARTS as simple strings to have the SMARTS proposal engine actually understand these. And since we have to move to SMIRKS anyway, we should really be doing this for SMIRKS and not for SMARTS.
Thus, @cbayly13 is attempting to plan the infrastructure that is going to be needed for this.
I'll update this issue (or one of us will) as we have more details.