mosdef-hub / foyer

A package for atom-typing as well as applying and disseminating forcefields
https://foyer.mosdef.org
MIT License
119 stars 78 forks source link

Plugin system for grammar styles #377

Open justinGilmer opened 3 years ago

justinGilmer commented 3 years ago

Describe the behavior you would like added to Foyer After discussions with @umesh-timalsina and @daico007 , It occurred to me that we might be able to include various grammars as plug-ins for foyer to use when graph matching.

Describe the solution you'd like Assuming the various chemical perception grammars could be represented as some type of compatible grammar for the lark-parser, this could enable different types of graph matching like SMIRKS, SMARTS, etc.

Describe alternatives you've considered N/A

Additional context This would be a tremendous amount of work, and is not yet ready for actual implementation until much later after #358 . This also will require expert-level domain knowledge of grammar parsing and development.

bkpgh commented 3 years ago

I've been working on a something that overlaps with this issue for the past few days. Because of some future goals I have with FF in foyer, I wanted to be able to use a broader set of SMARTS features than is currently implemented. I played with generalizing the GRAMMAR, but didn't think myself up to making those changes and all the changes elsewhere in the code that would be needed. Instead, focusing on atom types for chemical elements (not non-element '_' atoms), I just use a boolean switch in the call to FF.apply to select some new functionality and pass slightly modified SMARTS strings from the forcefield directly to rdkit to use its SMARTS substructure matching. The experimental code I have so far seems to work and can type all of the test OPLSAA molecules. The benefit is immediate access to almost the entire SMARTS grammar. For example, This: <Type name="opls_145" class="CA" element="C" mass="12.01100" def="[C;X3;r6]1[C;X3;r6][C;X3;r6][C;X3;r6][C;X3;r6][C;X3;r6]1" overrides="opls_141,opls_142" doi="10.1021/ja9621760"/>

can become: <Type name="opls_145" class="CA" element="C" mass="12.01100" def="[c]1ccccc1" overrides="opls_141,opls_142" doi="10.1021/ja9621760"/>

This works fine to type benzene. Definitions based on other atom types, for example in the aromatic H on carbon: <Type name="opls_146" class="HA" element="H" mass="1.00800" def="[H][c;%opls_145]" overrides="opls_144" desc="benzene H" doi="10.1021/ja9621760"/> works as "recursive smarts" is used and the above def is internally converted to: [#1][c;$([c]1ccccc1)].

The current defs in oplsaa.xml continue to work when passed through the new system. As evidenced in the above aromatic H defintion ( [H] --> [#1]), I had to do some ad hoc modifications to get the non-SMARTS-standard (or at least non-rdkit-SMARTS-standard) use of explicit H's to play nice with the rdkit implementation.

The current code is experimental, non-optimized, kludgy, without proper Exceptions or much validation, etc., but it seems to work just fine. Except for handling the boolean to turn this feature on/off, no changes to the code were made except in atomtyper.py and that is mostly additions. Of course, one must turn off FF validation if new defintions are to be used. The SMARTSGraph is replaced with a simple object that just holds the smarts_string, typemap, etc. and has a simple find_matches method that builds an rdkit molecule and calls rdkit substructure matching. I have an idea about how this might possibly be extended to non-element atoms, but that is unimplemented or tested.

It seems that something like this approach could be a valuable expansion of foyer capabilities without the work involved in an expanded GRAMMAR. Let me know if this is of interest to the developers.