Open SimonBoothroyd opened 3 years ago
Related: #967
@SimonBoothroyd also suggested (I think verbally?) matching all heavy atoms first, then adding hydrogens.
This is a blocker for some large-scale testing I - and probably others - would like to do; library charge assignment being significant slower than AM1BCC (https://github.com/openforcefield/openff-toolkit/pull/954#issuecomment-857068752) largely defeats the purpose of pre-computing charges, which I would like to do on molecule datasets that I might want end up running tests on 10s or 100s of times.
Just using charge_from_molecules
is in a sense an acceptable workaround, but wherever possible I don't want to assign parameters using input data that is not in a force field.
Setting unique=True
is very helpful (https://github.com/openforcefield/openff-toolkit/tree/fix-967). Some timings on my computer with a truncated molecule (the original blew up my memory) below. To sum up,
unique=False
, surprisingly, but cut the unique=True
time further to 40% (abs: 1.7 ms)In [1]: import time
import numpy
from simtk import unit
from openff.toolkit.topology import Molecule
from openff.toolkit.typing.engines.smirnoff import LibraryChargeHandler
from openff.toolkit.utils import (
GLOBAL_TOOLKIT_REGISTRY,
OpenEyeToolkitWrapper,
RDKitToolkitWrapper,
)
SMILES = "[N:1]=[P:2]([N:3][C:4]([H:9])([H:10])[H:11])([N:5][C:6]([H:12])([H:13])[H:14])[N:7][C:8]([H:15])([H:16])[H:17]"
molecule = Molecule.from_mapped_smiles(SMILES)
print(molecule.n_atoms)
library_charge_handler = LibraryChargeHandler(version=0.3)
library_charge_handler.add_parameter(
{
"smirks": molecule.to_smiles(mapped=True),
"charge": numpy.ones(molecule.n_atoms) * unit.elementary_charge
}
)
def find_matches(toolkit_wrapper, unique=None, match_heavy_first=None):
GLOBAL_TOOLKIT_REGISTRY._toolkits = [toolkit_wrapper()]
matches = library_charge_handler.find_matches(molecule.to_topology(),
unique=unique,
match_heavy_first=match_heavy_first)
return len(matches)
17
In [2]: for unique in [True, False]:
for toolkit_wrapper in [OpenEyeToolkitWrapper, RDKitToolkitWrapper]:
n_matches = find_matches(toolkit_wrapper, unique, False)
print(f"unique={str(unique):>5}, toolkit={toolkit_wrapper.__name__:>21}, match_heavy_first=False | {n_matches}")
n_matches = find_matches(toolkit_wrapper, unique, True)
print(f"unique={str(unique):>5}, toolkit={toolkit_wrapper.__name__:>21}, match_heavy_first= True | {n_matches}")
unique= True, toolkit=OpenEyeToolkitWrapper, match_heavy_first=False | 1 unique= True, toolkit= RDKitToolkitWrapper, match_heavy_first=False | 1 unique= True, toolkit= RDKitToolkitWrapper, match_heavy_first= True | 1 unique=False, toolkit=OpenEyeToolkitWrapper, match_heavy_first=False | 1296 unique=False, toolkit= RDKitToolkitWrapper, match_heavy_first=False | 1296 unique=False, toolkit= RDKitToolkitWrapper, match_heavy_first= True | 1296
In [3]: %timeit -r 3 -n 100 find_matches(OpenEyeToolkitWrapper, unique=False, match_heavy_first=False)
85.6 ms ± 294 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
In [4]: %timeit -r 3 -n 100 find_matches(RDKitToolkitWrapper, unique=False, match_heavy_first=False)
25.3 ms ± 73.8 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
In [5]: %timeit -r 3 -n 100 find_matches(OpenEyeToolkitWrapper, unique=True, match_heavy_first=False)
1.47 ms ± 30.6 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
In [6]: %timeit -r 3 -n 100 find_matches(RDKitToolkitWrapper, unique=True, match_heavy_first=False)
4.3 ms ± 9.01 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
In [7]: %timeit -r 3 -n 100 find_matches(RDKitToolkitWrapper, unique=False, match_heavy_first=True)
25.6 ms ± 69.7 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
In [8]: %timeit -r 3 -n 100 find_matches(RDKitToolkitWrapper, unique=True, match_heavy_first=True)
1.71 ms ± 22.3 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
Thinking out loud, and I'm probably wrong - the original goal here is to make LibraryChargeHandler.find_matches
more performant when passed a parameter generated from LibraryChargeType.from_molecule()
more performant. Is there any way to modify the parameter to make the SMARTS matching faster? I would have thought using mapped SMILES would have helped, but that's not the case.
Revamping the entirety of .find_matches
is a worthwhile objective but encompasses a much larger scope than the original problem. If there's a way we could get the library charges to (re-)match quicker without modifying the behavior of the other code paths, that would allow this to be pushed through much quicker.
Is there any way to modify the parameter to make the SMARTS matching faster?
If OpenFF treated isotopes, you could use that -- I think that's actually how I pin atom matches on RDKit atm
(Of course, this is very precarious if you expose it to real users instead of doing it as an interim hack.)
I've made some test cases for different SMARTS matching performance optimizations. This requires conda install -c conda-forge snakeviz
. To run the "high symmetry" librarycharge example, . run_high_symmetry.sh
and then visualize the output .prof
file using eg snakeviz symm_21_08_09_14_59_56.prof
.
On the most recent commit of #1036 (8af8f5ecb1020aed76a20ea67267ac2a493072bd ), I can get the first example to run with OpenEye in about 13 seconds, but it hangs with RDKit
Summary of #1036 is that unique
was added and turned on by default for LibraryChargeHandler.find_matches
(only) and no match_heavy_first
was added to the public API or publicly-accessible behavior.
It would still be valuable to have more speedups, especially as I want to rely on this more heavily (#806), but it doesn't seem easy to make progress on.
Describe the bug
I've come across a couple of cases where the library change handler SMARTS matching code hangs when attempting to match a full molecule SMILES.
This is likely because there are tens (to hundereds) of thousands of ways the SMILES pattern can be matched, all of which are enumerated and considered even though only one of the matches will eventually be used.
To Reproduce Steps to reproduce the behavior. A minimal reproducing set of python commands is ideal.
Output The full error message (may be large, that's fine. Better to paste too much than too little.)
Computing environment (please complete the following information):
conda list
Additional context Add any other context about the problem here.