openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
301 stars 88 forks source link

LibraryCharge handler slow SMARTS matching #971

Open SimonBoothroyd opened 3 years ago

SimonBoothroyd commented 3 years ago

Describe the bug

I've come across a couple of cases where the library change handler SMARTS matching code hangs when attempting to match a full molecule SMILES.

This is likely because there are tens (to hundereds) of thousands of ways the SMILES pattern can be matched, all of which are enumerated and considered even though only one of the matches will eventually be used.

To Reproduce Steps to reproduce the behavior. A minimal reproducing set of python commands is ideal.

import time

import numpy
from simtk import unit

from openff.toolkit.topology import Molecule
from openff.toolkit.typing.engines.smirnoff import LibraryChargeHandler
from openff.toolkit.utils import (
    GLOBAL_TOOLKIT_REGISTRY,
    OpenEyeToolkitWrapper,
    RDKitToolkitWrapper,
)

molecule: Molecule = Molecule.from_mapped_smiles(
    "[C:1]([N:2]1[C:3]([H:23])=[C:4]([H:24])[C:5]([H:25])=[C:6]1[C:7](=[S:8])"
    "[N:9]=[P:10]([N:11]([C:12]([H:26])([H:27])[H:28])[C:13]([H:29])([H:30])[H:31])"
    "([N:14]([C:15]([H:32])([H:33])[H:34])[C:16]([H:35])([H:36])[H:37])[N:17]"
    "([C:18]([H:38])([H:39])[H:40])[C:19]([H:41])([H:42])[H:43])([H:20])([H:21])"
    "[H:22]"
)

library_charge_handler = LibraryChargeHandler(version=0.3)
library_charge_handler.add_parameter(
    {
        "smirks": molecule.to_smiles(mapped=True),
        "charge": numpy.ones(molecule.n_atoms) * unit.elementary_charge
    }
)

for toolkit_wrapper in [OpenEyeToolkitWrapper, RDKitToolkitWrapper]:

    GLOBAL_TOOLKIT_REGISTRY._toolkits = [toolkit_wrapper()]

    start_time = time.perf_counter()

    matches = library_charge_handler.find_matches(molecule.to_topology())

    end_time = time.perf_counter()

    print(toolkit_wrapper.__name__, end_time - start_time, len(matches))

Output The full error message (may be large, that's fine. Better to paste too much than too little.)

Computing environment (please complete the following information):

Additional context Add any other context about the problem here.

lilyminium commented 3 years ago

Related: #967

@SimonBoothroyd also suggested (I think verbally?) matching all heavy atoms first, then adding hydrogens.

mattwthompson commented 3 years ago

This is a blocker for some large-scale testing I - and probably others - would like to do; library charge assignment being significant slower than AM1BCC (https://github.com/openforcefield/openff-toolkit/pull/954#issuecomment-857068752) largely defeats the purpose of pre-computing charges, which I would like to do on molecule datasets that I might want end up running tests on 10s or 100s of times.

Just using charge_from_molecules is in a sense an acceptable workaround, but wherever possible I don't want to assign parameters using input data that is not in a force field.

lilyminium commented 3 years ago

Setting unique=True is very helpful (https://github.com/openforcefield/openff-toolkit/tree/fix-967). Some timings on my computer with a truncated molecule (the original blew up my memory) below. To sum up,

In [1]: import time

        import numpy
        from simtk import unit

        from openff.toolkit.topology import Molecule
        from openff.toolkit.typing.engines.smirnoff import LibraryChargeHandler
        from openff.toolkit.utils import (
            GLOBAL_TOOLKIT_REGISTRY,
            OpenEyeToolkitWrapper,
            RDKitToolkitWrapper,
        )

        SMILES = "[N:1]=[P:2]([N:3][C:4]([H:9])([H:10])[H:11])([N:5][C:6]([H:12])([H:13])[H:14])[N:7][C:8]([H:15])([H:16])[H:17]"

        molecule = Molecule.from_mapped_smiles(SMILES)

        print(molecule.n_atoms)

        library_charge_handler = LibraryChargeHandler(version=0.3)
        library_charge_handler.add_parameter(
            {
                "smirks": molecule.to_smiles(mapped=True),
                "charge": numpy.ones(molecule.n_atoms) * unit.elementary_charge
            }
        )

        def find_matches(toolkit_wrapper, unique=None, match_heavy_first=None):
            GLOBAL_TOOLKIT_REGISTRY._toolkits = [toolkit_wrapper()]
            matches = library_charge_handler.find_matches(molecule.to_topology(),
                                                          unique=unique,
                                                          match_heavy_first=match_heavy_first)
            return len(matches)

17

In [2]: for unique in [True, False]:
        for toolkit_wrapper in [OpenEyeToolkitWrapper, RDKitToolkitWrapper]:
            n_matches = find_matches(toolkit_wrapper, unique, False)
            print(f"unique={str(unique):>5}, toolkit={toolkit_wrapper.__name__:>21}, match_heavy_first=False | {n_matches}")
        n_matches = find_matches(toolkit_wrapper, unique, True)
        print(f"unique={str(unique):>5}, toolkit={toolkit_wrapper.__name__:>21}, match_heavy_first= True | {n_matches}")

unique= True, toolkit=OpenEyeToolkitWrapper, match_heavy_first=False | 1 unique= True, toolkit= RDKitToolkitWrapper, match_heavy_first=False | 1 unique= True, toolkit= RDKitToolkitWrapper, match_heavy_first= True | 1 unique=False, toolkit=OpenEyeToolkitWrapper, match_heavy_first=False | 1296 unique=False, toolkit= RDKitToolkitWrapper, match_heavy_first=False | 1296 unique=False, toolkit= RDKitToolkitWrapper, match_heavy_first= True | 1296

In [3]: %timeit -r 3 -n 100 find_matches(OpenEyeToolkitWrapper, unique=False, match_heavy_first=False)

85.6 ms ± 294 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)

In [4]: %timeit -r 3 -n 100 find_matches(RDKitToolkitWrapper, unique=False, match_heavy_first=False)

25.3 ms ± 73.8 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)

In [5]: %timeit -r 3 -n 100 find_matches(OpenEyeToolkitWrapper, unique=True, match_heavy_first=False)

1.47 ms ± 30.6 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)

In [6]: %timeit -r 3 -n 100 find_matches(RDKitToolkitWrapper, unique=True, match_heavy_first=False)

4.3 ms ± 9.01 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)

In [7]: %timeit -r 3 -n 100 find_matches(RDKitToolkitWrapper, unique=False, match_heavy_first=True)

25.6 ms ± 69.7 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)

In [8]: %timeit -r 3 -n 100 find_matches(RDKitToolkitWrapper, unique=True, match_heavy_first=True)

1.71 ms ± 22.3 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)

mattwthompson commented 2 years ago

Thinking out loud, and I'm probably wrong - the original goal here is to make LibraryChargeHandler.find_matches more performant when passed a parameter generated from LibraryChargeType.from_molecule() more performant. Is there any way to modify the parameter to make the SMARTS matching faster? I would have thought using mapped SMILES would have helped, but that's not the case.

Revamping the entirety of .find_matches is a worthwhile objective but encompasses a much larger scope than the original problem. If there's a way we could get the library charges to (re-)match quicker without modifying the behavior of the other code paths, that would allow this to be pushed through much quicker.

lilyminium commented 2 years ago

Is there any way to modify the parameter to make the SMARTS matching faster?

If OpenFF treated isotopes, you could use that -- I think that's actually how I pin atom matches on RDKit atm

(Of course, this is very precarious if you expose it to real users instead of doing it as an interim hack.)

j-wags commented 2 years ago

I've made some test cases for different SMARTS matching performance optimizations. This requires conda install -c conda-forge snakeviz. To run the "high symmetry" librarycharge example, . run_high_symmetry.sh and then visualize the output .prof file using eg snakeviz symm_21_08_09_14_59_56.prof.

2021_08_09_matching_profiling.tar.gz

mattwthompson commented 2 years ago

On the most recent commit of #1036 (8af8f5ecb1020aed76a20ea67267ac2a493072bd ), I can get the first example to run with OpenEye in about 13 seconds, but it hangs with RDKit

mattwthompson commented 2 years ago

Summary of #1036 is that unique was added and turned on by default for LibraryChargeHandler.find_matches (only) and no match_heavy_first was added to the public API or publicly-accessible behavior.

It would still be valuable to have more speedups, especially as I want to rely on this more heavily (#806), but it doesn't seem easy to make progress on.