openforcefield / smarty

Chemical perception tree automated exploration tool.
http://openforcefield.org
MIT License
19 stars 8 forks source link

How to treat base types in smarty #222

Closed bannanc closed 7 years ago

bannanc commented 7 years ago

We've realized one reason smarty cannot get 100% on DrugBank is due to the inability to "OR" element numbers together. This stops us from finding electron withdrawing groups for example. My first instinct is to put $ewg groups into the base types as those are groups of atoms we have decided are worth keeping grouped together. These are already in the substitutions SMARTS file.

However simply including $ewg in the input file won't be enough. We only keep populated basetypes. We do this by assigning the basetypes to the molecules in the same way the parameters are assigned. The hierarchy means you can't have two base or initial types that match the same atom.

This means if you had the input list:

[#1]      H
[#6]      C
[#8]      O
$ewg1 ewg

where $ewg1 is [#7!-1,#8!-1,#16!-1,#9,#17,#35,#53] The electron withdrawing group would overwrite oxygen. A similar issue came up where it would be worthwhile to have [*] as a base type. That will allow things like [#6$(=)] as a bonding option.

I think the easiest fix for this is to check if base types can type an atom without overwriting, I do something similar in smirky and will add a code snippet later today.

davidlmobley commented 7 years ago

@bannanc - I'm not totally clear on this phrasing:

I think the easiest fix for this is to check if base types can type an atom without overwriting,...

By "type without overwriting", you mean just check if base types can type anything, without checking to see whether they would empty the other base types, right?

Maybe there should be a warning printed or logged in this case, though, as in some cases this will clearly be a good thing (e.g. as in the case of macros) but in others it would be a mistake.

bannanc commented 7 years ago

Sorry, I should have been more clear here.

"type" was the wrong word. I think we should check if a SMARTS pattern matches any atoms in the set. This is different from assigning a type each atom in the set.

Maybe there should be a warning printed or logged in this case, though, as in some cases this will clearly be a good thing (e.g. as in the case of macros) but in others it would be a mistake.

This could be good, thought it will require a more complicated code solution I think.

In existing methods we have two ways of checking smarts matching:

  1. the atomtyper method AtomTyper.assignTypes(molecules, element) this assigns each atom StringData at the typetag matching the current atom type name.

  2. I have a code snippet in smirky that keeps any SMIRKS pattern that matches any set of atoms, it has no memory of if the same set of atoms have matched another pattern before.

bannanc commented 7 years ago

@camizanette

Let me know if you need me to point to specific code snippets

jchodera commented 7 years ago

Shouldn't "electron withdrawing group" be a decorator, rather than a base type?

bannanc commented 7 years ago

@jchodera

Shouldn't "electron withdrawing group" be a decorator, rather than a base type?

Currently SMARTY essentially have two options, add a decorator to the current atomtype or bind it to a new alpha or beta substituent atom. I suppose we could treat "singly bound to an electron group" as a decorator... In that case the decorator would have to be $(*-$ewg).

@davidlmobley and @camizanette what are your thoughts here.

I will admit that every time these issues come up it seems like it would just be easier to use the Chemical Environment objects inside SMARTY the same way we use them in SMIRKY. Then it could add basetypes by ORing them. However, this would require a complete rework of SMARTY.

davidlmobley commented 7 years ago

@bannanc - I don't have a strong preference one way or the other, though it seems simpler to me to just think of it as a base type here.

I agree that it would be easier to use the Chemical Environment objects, but since smarty is close to dying anyway (that is to say, it only exists til we finish this paper then everything moves in the direction of smirky) so it seems unnecessary to me. SMIRKY already has chemistry built in and samples in a better way and, effectively, SMARTY is a subset of SMIRKY except that SMIRKY uses SMIRKS. So it just seems like a lot of wasted effort to me...

bannanc commented 7 years ago

I agree that it would be easier to use the Chemical Environment objects, but since smarty is close to dying anyway (that is to say, it only exists til we finish this paper then everything moves in the direction of smirky) so it seems unnecessary to me. SMIRKY already has chemistry built in and samples in a better way and, effectively, SMARTY is a subset of SMIRKY except that SMIRKY uses SMIRKS. So it just seems like a lot of wasted effort to me...

That's my feeling.

I'm also concerned that treating "bonded to a..." as a decorator will require making more changes inside the smarty string manipulations so I would lean more toward treating it as a base type instead.

davidlmobley commented 7 years ago

I'm also concerned that treating "bonded to a..." as a decorator will require making more changes inside the smarty string manipulations so I would lean more toward treating it as a base type instead.

Yes, totally agree. The string manipulations here are not easy (especially handling of alpha and beta substituents) and I worry this would send of us off into regex hell.

bannanc commented 7 years ago

@camizanette The other thing to think about here is we also add basetypes to the initial types so we might want to think about how that is done. Specifically, I think if we're going to put basetypes into the initial types list they need to go in at the beginning and not the end so they can be overwritten by user specified initial types. I don't think this is something we thought about before with regard to adding basetypes to the initial types.

bannanc commented 7 years ago

Checking if a SMARTS pattern is in a molecule. I realized the section in SMIRKY uses a few things that don't necessarily need to be transferred, so here is the method I would suggest adding:

def smarts_matches(self, smarts):
    """
    This method returns true if the provided SMARTS pattern is in 
    at least one molecule
    Parameters
    ----------
    smarts: str, SMARTS pattern
    Returns
    -------
    matched: boolean, True=smarts matches a molecule, False has no matches
    """
    # Create substitution list 
    substitutions = list()
    for [smarts,shortname] in self.replacements:
         substitutions.append( (shortname, smarts) )

    # Update input smarts with substitutions
    smarts = OESmartsLexReplace(input_smarts, substitutions)

    # create query
    qmol = OEQMol()
    if not OEParseSmarts(qmol, smarts):
        raise Exception("Error parsing SMARTS %s" % smarts)
    ss = OESubSearch(qmol)
    for mol in self.molecules:
        if ss.SingleMatch(mol):
            return True
    return False

Then in the section where we check if a basetype matches the molecules you could call this instead. Note - self.replacements in the smarty input are stored in a list of tuples (SMARTS, shortname), but OESmartsLexReplace expects a list in the opposite direction (shortname, SMARTS).

bannanc commented 7 years ago

@camizanette That ^ is what I would do

camizanette commented 7 years ago

Thanks, @bannanc

bannanc commented 7 years ago

Please see discussion in pull request #234 for more details about why this is a problem and how to fix it.

@camizanette I updated the code snippet method above with how to use the replacements list.

bannanc commented 7 years ago

Closed with #234