openforcefield / smarty

Chemical perception tree automated exploration tool.
http://openforcefield.org
MIT License
19 stars 8 forks source link

Chemical Environments handling replacement strings incorrectly #248

Closed bannanc closed 7 years ago

bannanc commented 7 years ago

I'm not sure why this hasn't come up before. I opened the error output from one of my recent SMIRKY simulations and found a number of the following warnings:

Warning: Unable to parse SMARTS
Warning: [$([$([#9,#17]),$([#35,#53])])&;!X4:1]~[$([#7!-1,#8!-1,#16!-1,$([$([#9,#17]),$([#35,#53])])])&r3,#7X3:2]

I know this is a complex looking smirks pattern, specifically the warning is pointing at the & in the first atom [$([$([#9,#17]),$([#35,#53])])&;!X4:1]. The sort of crazy looking $([$([#9,#17]),$([#35,#53])]) corresponds to the replacement $halogens.

That '&' is added in chemical environments when there are ORdecorators for a "replacement" ORBase (such as $ewg1 or $halogen). Most of the time ORdecorators are "AND'd" to an ORBase implicitly, for example [#7X3+0:1] has an ORBase #7 and a list of ORdecorators ['X3', '+0']. Here there is an implicit & between the #7 and the X3 and before the +0. When you have a replacement string you can't use an implicit & because it won't recognize something like $ewg1X1 but it will recognize $ewg1&X1.

I think there is problem in Chemical Environments I'll try to get a pull request open later today with a suggested fix.

davidlmobley commented 7 years ago

Probably we weren't using the $ewg macro until recently so we didn't run into this...?

Thanks, @bannanc .

bannanc commented 7 years ago

We were always using $ewg1 and $ewg2 in smirky. I think what causes this problem is a case where an ORdecorator is set as "" which is used to indicate no decorator. I thought the Chemical Environments were handling that case correctly (that is removing blank decorators), but I guess not.

In testing this, I've run into a few other cases that they don't quite handle the way I would like. Right now, when creating an environment you can give it a list of replacements. This was intended so SMIRKS patterns could be tested as valid even if they had replacement strings (i.e. $ewg1). I hadn't tested this thoroughly.

I would like to update the SMIRKS parsing to better handle replacement strings. My feeling is that if a user is using something like $ewg1 they are thinking of that as one piece (typically an ORbase but I guess you could make one that was a decorator too). I think it would be preferable for Chemical Environments to keep these together. So if you gave Chemical Environments the following SMIRKS and replacement, it would be able to know that the SMIRKS pattern is valid, but wouldn't break up the string starting with a $.

s = '[$ewg1&X1+0:1]'
r = [  ('ewg1', '[#7!-1,#8!-1,#16!-1,#9,#17,#35,#53]')
env = ChemicalEnvironment(smirks = s, replacements = r)
env.isValid() * # returns True using the replacements

a1 = Env.selectAtom(1) # gets atom with index 1
a1.getORtypes() # returns ORtypes for atom 1 with the form [ ('$ewg', ['X1', '+0'] ) ] 

The alternative I suppose is to include parsing the smirks after replacing the ewg1 so instead you would get separated ORtypes for atom1 would have the form:

[ ('#7', ['!-1', 'X1', '+0'] ),  ('#8', ['!-1', 'X1', '+0'] ),  ('#16', ['!-1', 'X1', '+0'] ),  ('#9', ['X1', '+0'] ), 
('#17', ['X1', '+0'] ), ('#35', ['X1', '+0'] ), ('#53', ['X1', '+0'] ) ]

My feeling is that if a user is using the replacement strings, they are thinking of them as a group not as the separate pieces.

Right now the parsing doesn't quite handle the $ewg1 type input correctly either way so I'm working on updating environments and performing more inclusive tests with them. I'll most likely also update the test script.

* - Note isValid doesn't exist right now, I'm going to include it in the pull request for this issue.

@davidlmobley when you have time I'd like your feedback here, but I will continue working on fixing the parsing so it isn't urgent.

davidlmobley commented 7 years ago

@bannanc - I'm not sure if I totally understand this part:

I would like to update the SMIRKS parsing to better handle replacement strings. My feeling is that if a user is using something like $ewg1 they are thinking of that as one piece (typically an ORbase but I guess you could make one that was a decorator too). I think it would be preferable for Chemical Environments to keep these together. So if you gave Chemical Environments the following SMIRKS and replacement, it would be able to know that the SMIRKS pattern is valid, but wouldn't break up the string starting with a $.

(and the example which follows). You haven't explained in your example what r is there - is that just explaining what $ewg represents?

If I understand right, you're asking whether I agree that the way of parsing as illustrated in the first box of your comment is what we would expect/want the behavior to be? I agree that the first box makes more sense to me than the second and is much more compact. Though, it's possible that long term we might want an option of expanding these out, e.g.: a1.getORtypes() # returns ORtypes for atom 1 with the form [ ('$ewg', ['X1', '+0'] ) ] but a1.getORtypes(expand=True) # returns ORtypes for atom 1 with the form [ ('#7', ['!-1', 'X1', '+0'] ), ('#8', ['!-1', 'X1', '+0'] ), ('#16', ['!-1', 'X1', '+0'] ), ('#9', ['X1', '+0'] ), ('#17', ['X1', '+0'] ), ('#35', ['X1', '+0'] ), ('#53', ['X1', '+0'] ) ]

But, honestly, this doesn't seem like a functionally important difference to me (they're going to mean the same thing!) so I don't have that strong an opinion on it.

Note also we should probably move discussion of these issues to the openforcefield repo, as per discussion elsewhere.

bannanc commented 7 years ago

@davidlmobley I didn't put a comment describing r because I thought it would be obvious from the environment call that the r is a list of replacements/substitutions.

a1.getORtypes(expand=True) ...

Doing this as a separate option doesn't actually make much sense to me, and would make the internal code on environment much more complicated I think. The get function getORtypes is just a getter (Christopher's request). It returns the same thing as a1.ORtypes which stores the already parsed SMIRKS string.

I'm guessing that as the future of SMIRNOFF goes forward we won't actually want to use the substitution/replacement strings because the forcefield functionality doesn't support them. If we did use them you would have to include something in the ffxml files that told the forcefield scripts how to parse them which we don't right now.

I'm going to continue with the plan to support patterns like $ewg1 in environments, but to not separate them into their components, I think its unnecessarily complicated.

When this code actually works all the way I'll move everything to the openforcefield repo.


The other parsing issue has to do with environment assuming all atoms are specified with square brackets. However, SMIRKS/SMARTS build on smiles strings so they can also use elemental symbols. For example, this is a reasonable smirks string for an H1:

[#1:1]-C-$ewg1

Right now (before pull request) the environment would look for atoms inside square brackets [stuff] and would only find the hydrogen and would loose all the other information.

I am adding some sort of weird RegEx patterns that fix this problem so that it can correctly parse the carbon and $ewg1. It still has a few bugs, but I should be able to get them ironed out in the morning.

davidlmobley commented 7 years ago

@bannanc :

Doing this as a separate option doesn't actually make much sense to me, and would make the internal code on environment much more complicated I think. The get function getORtypes is just a getter (Christopher's request). It returns the same thing as a1.ORtypes which stores the already parsed SMIRKS string.

OK, this makes sense to me. We don't want that, so we should just proceed without separating it out.

I am adding some sort of weird RegEx patterns that fix this problem so that it can correctly parse the carbon and $ewg1. It still has a few bugs, but I should be able to get them ironed out in the morning.

THis sounds good. Thanks.

bannanc commented 7 years ago

This is all handled in openforcefield PR #7

When issue #244 is addressed this issue will also be closed.

bannanc commented 7 years ago

Fixed in openforcefield PR #7