openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
319 stars 93 forks source link

[Request] Easier selection of tookit registries #966

Open lilyminium opened 3 years ago

lilyminium commented 3 years ago

Is your feature request related to a problem? Please describe.

The toolkit allows you to select which toolkit_registry you would like to use for individual methods, but as soon as we get into more advanced functions that call these methods it is quite difficult to choose your toolkit. To give an example case, I wanted to create_openmm_system. Using the OpenEye toolkit to match smarts was hanging, whereas using RDKit worked. However, the function call goes create_force -> find_matches -> _find_matches -> chemical_environment_matches so my only options for modifying the toolkit used by chemical_environment_matches were:

  1. Modify the GLOBAL_TOOLKIT_REGISTRY (except this is used by other functions that I did not want to change, e.g. partial charge calculation)
  2. Go into the code and change that part of it -- not transferrable and not very accessible to novice programmers
  3. (what I ended up doing) replacing chemical_environment_matches as below, also not very accessible to novice programmers.
chemical_environment_matches = Topology.chemical_environment_matches

RDKIT_TOP_REGISTRY = ToolkitRegistry(
    toolkit_precedence=[
        RDKitToolkitWrapper,
        OpenEyeToolkitWrapper,
        AmberToolsToolkitWrapper,
        BuiltInToolkitWrapper,
    ],
    exception_if_unavailable=False,
)

def new_chemical_environment_matches(*args, **kwargs):
    toolkit_registry = kwargs.get("toolkit_registry", RDKIT_TOP_REGISTRY)
    kwargs["toolkit_registry"] = toolkit_registry
    return chemical_environment_matches(*args, **kwargs)

Topology.chemical_environment_matches = new_chemical_environment_matches

Describe the solution you'd like A clear and concise description of what you want to happen.

Pass keyword arguments such as toolkit_registry through the function call chain.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

mattwthompson commented 3 years ago

the function call goes create_force -> find_matches -> _find_matches -> chemical_environment_matches

It looks like this chain was broken pretty early, since **kwargs stops being passed through once find_matches is called? I don't know what the best pattern is, but the current design of passing a toolkit registry around does kinda fall apart if somebody wanted to use different toolkits for different steps (i.e. here you want to use OpenEye for charge assignment but RDKit for SMARTS matching?).

Pass keyword arguments such as toolkit_registry through the function call chain.

In this case, you wouldn't be able to pick which toolkit is used for what - is that okay?

Related #493 (maybe #530 and/or #158)

lilyminium commented 3 years ago

@mattwthompson Thanks for the links to earlier issues.

In this case, you wouldn't be able to pick which toolkit is used for what - is that okay?

I realize that the keyword argument wouldn't have helped me, but I also think my scenario is (hopefully) rare -- I have no idea why OpenEye doesn't want to match my SMARTS :(. At least passing keyword arguments solves some number of cases with minimal effort.

Real mix and matching sounds difficult to implement.

mattwthompson commented 3 years ago

Yes, I could have done a better job of separating out two ideas here -

  1. Passing a toolkit registry to create_openmm_system does not guarantee that that registry is used in the internal calls, down all the way to chemical_environment_matches. This seems possible to implement, but might run into weird behavior with all the complexity of these function chains?
  2. There's no way to mix-and-match which toolkits are used in different parts of the function chains. This does seem much harder to implement with the current design.
lilyminium commented 3 years ago

Yeah, sounds about right.

I guess if you really wanted to have fun you could make separate default registries for each method. CHEMICAL_ENVIRONMENT_MATCHES_TOOLKIT_REGISTRY, ASSIGN_PARTIAL_CHARGES_TOOLKIT_REGISTRY, ... 🎊

SimonBoothroyd commented 3 years ago

+1 for this request. I mostly just use a custom context manager when dealing with these cases now, but it would be good to have a better solution built into the toolkit directly.

I guess if you really wanted to have fun you could make separate default registries for each method. CHEMICAL_ENVIRONMENT_MATCHES_TOOLKIT_REGISTRY, ASSIGN_PARTIAL_CHARGES_TOOLKIT_REGISTRY, ... 🎊

This is along the lines of something I've been wanting to suggest for a while. You could essentially have a set of base registry mix-in classes:

class ToolkitChargeProvider(ToolkitMixin, abc.ABC):

    @classmethod
    @abc.abstractmethod
    def compute_partial_charges(cls, molecule: Molecule, ...) -> unit.Quantity
        ...

class ToolkitConformerProvider(ToolkitMixin, abc.ABC):

    @classmethod
    @abc.abstractmethod
    def generate_conformers(cls, molecule: Molecule, ...) -> List[unit.Quantity]
        ...

...

and then the individual toolkits wrapper would then just inherit from the mix-ins which it could support:

class OpenEyeToolkitWrapper(ToolkitChargeProvider, ToolkitConformerProvider)

    @classmethod
    def compute_partial_charges(cls, molecule: Molecule, ...) -> unit.Quantity
        ...

    @classmethod
    def generate_conformers(cls, molecule: Molecule, ...) -> List[unit.Quantity]
        ...

in this way registries can both be queried for which features they can provide (i.e. isinstance(oe_tk_registry, ToolkitConformerProvider) = True, isinstance(at_tk_registry, ToolkitConformerProvider) = False), and it would also allow you to specify what TK implementation you want to use by default for specific feature sets like you mention @lilyminium

SimonBoothroyd commented 3 years ago

+1 for also adding the ability to set the defaults either by a config file or environmental variables.

adalke commented 3 years ago

I'll propose something a bit more radical than @SimonBoothroyd 's context manager.

I think there are two ideas in the OpenFF toolkit - one is "a wrapper to a chemistry toolkit" and the other is "an aggregated set of functions to call".

I'll call the latter a "toolbox".

If you've ever used Python's "decimal" package, you'll see that it has a way to set the (per-thread) global context, like the number of digits to use. This makes it much easier to use than passing the context around to each mathematical operation.

I propose there should be a similar "current_toolbox", similar to the GLOBAL_TOOLKIT_REGISTRY in that all functions can use that toolbox to get the current configuration.

By default this contain the existing set of toolkits.

If you want to change the toolkit, then you create it, and use it in a context manager, like this:

# Override "from_smiles"
def my_from_smiles(smiles):
    print(f"Making acetate instead of {smiles!r}") 
    from openff.toolkit.tests import create_molecules
    return create_molecules.create_acetate()

my_toolbox = get_toolbox(
    from_smiles = my_from_smiles,
    )

with my_toolbox:
    # If functions aren't found in my_toolbox then
    # it will search the context stack
    mol = current_toolbox.from_smiles("C#N")
    print("mol", mol)

The context manager keeps track of the toolkit context stack. If an operation isn't found in the current toolkit it checks the older toolkits in the stack.

In some cases, it can be important to not check the stack. In that case, the toolkit can be configured to not inherit dynamically from the stack:

# Ensure we only use RDKit for this method.
must_use_rdkit = get_toolbox(rdkit_toolkit_wrapper, inherit_=False)

def return_3x_methane():
    tb = current_toolbox
    with must_use_rdkit:
        mol = tb.from_smiles("C")
        smi = tb.to_smiles(mol)
        return ".".join([smi]*3)

In other cases (like for debugging) it can be useful to interpose a call in the call chain:

# Interpose the call, and forward to its super()
def debug_call(*args, **kwargs):
    current_call = get_current_call()
    next_call = current_call.get_super()
    name = current_call.name
    print(f"DEBUG: about to call {name}() in {next_call.__module__}")
    try:
        result = current_call.get_super()(*args, **kwargs)
    except:
        print("DEBUG: exception", name, *args, **kwargs)
        raise
    else:
        print("DEBUG: returned", name, result)
    return result

my_toolbox = get_toolbox(
    from_smiles = my_from_smiles,
    return_3x_methane = return_3x_methane,
    to_smiles = debug_call,
    )

I put together an example implementation, for a more concrete evaluation, at https://gist.github.com/adalke/5563a54ba95edab06cc96cbe56a592ca .

The default uses OpenEyeToolkitWrapper then RDKitToolkit wrapper. I show how that can be changed with multiple contexts, like this:

with my_toolbox:
    # If functions aren't found in my_toolbox then
    # it will search the context stack

    mol = current_toolbox.from_smiles("C#N")
    print("mol", mol)
    print("to_smiles:", current_toolbox.to_smiles(mol))
    print("3x_methane:", current_toolbox.return_3x_methane())

with get_toolbox(rdkit_toolkit_wrapper):
    with my_toolbox:
        # If functions aren't found in my_toolbox then
        # it will search the context stack

        mol = current_toolbox.from_smiles("C#N")
        print("mol", mol)
        print("to_smiles:", current_toolbox.to_smiles(mol))
        print("3x_methane:", current_toolbox.return_3x_methane())

The output is:

Making acetate instead of 'C#N'
mol Molecule with name '' and SMILES '[H]C([H])([H])C(=O)[O-]'
DEBUG: about to call to_smiles() in openff.toolkit.utils.openeye_wrapper
DEBUG: returned to_smiles [H]C([H])([H])C(=O)[O-]
to_smiles: [H]C([H])([H])C(=O)[O-]
3x_methane: [H][C]([H])([H])[H].[H][C]([H])([H])[H].[H][C]([H])([H])[H]
Making acetate instead of 'C#N'
mol Molecule with name '' and SMILES '[H]C([H])([H])C(=O)[O-]'
DEBUG: about to call to_smiles() in openff.toolkit.utils.rdkit_wrapper
DEBUG: returned to_smiles [H][C]([H])([H])[C](=[O])[O-]
to_smiles: [H][C]([H])([H])[C](=[O])[O-]
3x_methane: [H][C]([H])([H])[H].[H][C]([H])([H])[H].[H][C]([H])([H])[H]

which shows that to_smiles() in the first case used openeye_wrapper and in the second used rdkit_wrapper.

adalke commented 3 years ago

My hope, under this sort of system, that that @lilyminium's issue could be handled like:

with get_toolbox(
   chemical_environment_matches = rdkit_toolkit.chemical_environment_matches
   ):
     system = current_toolbox.create_openmm_system( ... )

A full implementation would need to change things like (from molecule.py):

    def strip_atom_stereochemistry(
        self, smarts, toolkit_registry=GLOBAL_TOOLKIT_REGISTRY
    ):
     ...
        chem_env = AtomChemicalEnvironment(smarts)
        matches = self.chemical_environment_matches(
            chem_env, toolkit_registry=toolkit_registry
        )

to something more like:

    def strip_atom_stereochemistry(self, smarts):
      ...
        chem_env = AtomChemicalEnvironment(smarts)
        matches = current_toolbox.chemical_environment_matches(chem_env)

I can think of ways to handle backwards compatibility, for example, the initial/top-stack of the "current_toolbox" could use the existing registry system, and the registries could implement an appropriate context manager, where GLOBAL_TOOLKIT_REGISTRY doesn't change anything.

With this approach I think there would be little need to support registering/deregistering toolkits, since that functionality can be done with a new "toolbox" context.

davidlmobley commented 2 years ago

Seems nice to have, probably not essential; if there's a low barrier way to achieve this without much developer time I'm all for it, or if someone wants to volunteer... but it's not aligned with our major priorities I think. I can see why it's appealing though.

mattwthompson commented 2 years ago

I don't think anybody has bit on implementing this; as it stands Interchange will likely follow the pattern of using GLOBAL_TOOLKIT_REGISTRY most of the time but allowing a custom registry to be passed in. This has similar tradeoffs as exist now, i.e. you can turn off all calls to RDKit but you can't conveniently mix-and-match which toolkits are used for specific functionality.

mattwthompson commented 2 years ago

There is now a (private!) context manager that might be useful for some of this behavior, landing in v0.11.0:

from openff.toolkit.utils.toolkit_registry import _toolkit_registry_manager

RDKIT_TOP_REGISTRY = ToolkitRegistry(
    toolkit_precedence=[
        RDKitToolkitWrapper,
        OpenEyeToolkitWrapper,
        AmberToolsToolkitWrapper,
        BuiltInToolkitWrapper,
    ],
    exception_if_unavailable=False,
)

with _toolkit_registry_manager(RDKIT_TOP_REGISTRY):
    # will now behave with RDKIT_TOP_REGISTRY taking the place of GLOBAL_TOOLKITREGISTRY
    do_stuff(...)

If this is useful we may consider hardening it and adding it to the public API. But until then - no guarantees about stability, etc.