Template cache searches are too slow

openmm / openmmforcefields

CHARMM and AMBER forcefields for OpenMM (with small molecule support)

http://openmm.org

Other

254 stars 80 forks source link

Template cache searches are too slow #302

Open mark-mackey-cresset opened 1 year ago

mark-mackey-cresset commented 1 year ago

Currently, when a molecule is passed to a template generator it's looked for in the cache (if one is provided). The algorithm for doing this involves creating a Molecule object for each SMILES string in the cache, and then comparing its topology to the molecule being parameterised. In our benchmarks, this means that only around 10 or so cache entries are checked per second, as Molecule creation is expensive. If the cache has been in use for a while, it may well have several thousand entries in it, which means that checking the cache can take a lot more time than just doing the parameterisation (even including the AM1-BCC compute time).

Given that this is just an identity check, I can't see any reason why at least the initial check on the cache isn't just a SMILES string comparison, which would speed things up by several orders of magnitude.

jchodera commented 12 months ago

This is a great idea! Apologies it's taken us so long to address.

@mikemhenry : Are you able to take a stab at this?

mark-mackey-cresset commented 12 months ago

The only possible fly in the ointment is going to be choosing whose unique SMILES string to use. I imagine most users are either using RDkit or OE exclusively, but if anyone's switching you could end up with cache keys that are a mixture of SMILES from the two, which is unlikely to be useful :). I think you'll have to store which toolkit a given key was generated with, and for entries that don't have a key for the current toolkit fall back to the current slow compare-the-Molecule-object code. Alternatively use InCHI for the keys, but that's might require brining in an extra dependency?

jchodera commented 12 months ago

Great point.

We should use the OpenFF Molecule.to_smiles() to generate the SMILES string for lookup, which will produce a canonical isomeric SMILES representation that we can cache internally and reproducibly---at least as long as they are still using the same toolkit backend.

Even if someone ends up using the same cache with different toolkit backends, at worst, you'll get a separate copy of each molecule for each toolkit backend, both both should be correct, which will only slow things down a bit.

mikemhenry commented 11 months ago

Can do, I think using Molecule.to_smiles() does make the most sense for the key. I've ran into issue juggling different chem informatic toolkits so I think sticking to OpenFF makes the most sense.

mikemhenry commented 11 months ago

Just wanted to add that this hasn't fallen off my radar, I am working on a benchmark script so I can measure the speed-up

mattwthompson commented 11 months ago

You might want to consider mapped SMILES or even CMILES as the hash, especially if you end up using @functools.lru_cache. We recently ran into a critical-yet-esoteric bug that mapped SMILES would have safeguarded against.

jchodera commented 11 months ago

Great suggestion, @mattwthompson !