openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
311 stars 91 forks source link

`Topology.add_molecule` is slow when adding many molecules to a topology #1916

Closed mattwthompson closed 2 months ago

mattwthompson commented 2 months ago

Is your feature request related to a problem? Please describe.

Topology.add_molecule is slow when called many (hundreds or thousands) of times:

In [3]: import time
   ...:
   ...: import numpy
   ...: from openff.toolkit import Molecule, Topology
   ...:
   ...:
   ...: water = Molecule.from_smiles("O")
   ...:
   ...: file = open("data.csv", "a")
   ...:
   ...: file.write("#n,add_molecule,from_molecules")
   ...:
   ...: for n_solvent in numpy.linspace(100, 5_000, 50, dtype=int):
   ...:
   ...:     start = time.time()
   ...:
   ...:     topology = Topology()
   ...:
   ...:     for index in range(n_solvent):
   ...:         topology.add_molecule(water)
   ...:
   ...:     path1 = time.time() - start
   ...:
   ...:     del topology
   ...:
   ...:     start = time.time()
   ...:
   ...:     Topology.from_molecules(n_solvent * [water])
   ...:
   ...:     path2 = time.time() - start
   ...:
   ...:     file.write(f"\n{n_solvent},{path1},{path2}")
   ...:
   ...: file.close()

Out:

n | add_molecule | from_molecules

-- | -- | -- 100 | 0.005265951157 | 0.003153800964 200 | 0.01143503189 | 0.006229877472 300 | 0.02010607719 | 0.009879112244 400 | 0.03147602081 | 0.01276278496 500 | 0.04498314857 | 0.01589989662 600 | 0.09369587898 | 0.02003026009 700 | 0.07955312729 | 0.02353787422 800 | 0.09967899323 | 0.05754804611 900 | 0.1233439445 | 0.03064894676 1000 | 0.1784200668 | 0.03322601318 1100 | 0.1743271351 | 0.03602099419 1200 | 0.2516520023 | 0.03938388824 1300 | 0.2655379772 | 0.04273581505 1400 | 0.2654650211 | 0.08011960983 1500 | 0.3049080372 | 0.07958674431 1600 | 0.3679380417 | 0.09016990662 1700 | 0.4580187798 | 0.09267616272 1800 | 0.4605190754 | 0.09686183929 1900 | 0.4915962219 | 0.09718298912 2000 | 0.5896348953 | 0.06572294235 2100 | 0.6378009319 | 0.06884002686 2200 | 0.6888310909 | 0.1076860428 2300 | 0.7243771553 | 0.1178209782 2400 | 0.8062388897 | 0.1126852036 2500 | 0.8307061195 | 0.1238048077 2600 | 0.9382829666 | 0.1206080914 2700 | 0.9951388836 | 0.1316249371 2800 | 1.053414106 | 0.1318500042 2900 | 1.251863241 | 0.133026123 3000 | 1.702547073 | 0.1354470253 3100 | 1.217139959 | 0.1501860619 3200 | 1.318310738 | 0.140696764 3300 | 1.408802986 | 0.1381850243 3400 | 1.475514174 | 0.1486041546 3500 | 1.566795111 | 0.1464569569 3600 | 1.677216053 | 0.1541819572 3700 | 1.740601778 | 0.1566593647 3800 | 1.842238903 | 0.1587378979 3900 | 1.924557209 | 0.1610488892 4000 | 2.058802128 | 0.1687648296 4100 | 2.113446951 | 0.1700348854 4200 | 2.231441736 | 0.1775681973 4300 | 2.388463259 | 0.174751997 4400 | 2.510241032 | 0.1843349934 4500 | 2.543238163 | 0.2115340233 4600 | 2.653229713 | 0.2233736515 4700 | 2.79717803 | 0.1896550655 4800 | 2.962532282 | 0.1973302364 4900 | 3.24529314 | 0.2079119682 5000 | 3.197937727 | 0.2045538425

image

I'm also running into significant memory issues when adding more than ~5,000 waters, but unable to clearly isolate and report them.

The key difference is that Topology.from_molecules adds all molecules at once and then updates the cache whereas Topology.add_molecule adds the molecule (fast) updates the cache (not as fast) on each call. This behavior difference is not documented (and I believe this docstring is incorrect.) In the above reproduction, there isn't any reason to use .add_molecule, but in many use cases involving macromolecules, we need to add individual molecules to an existing topology, sometimes many. The scaling here appears to be super-linear and raises concerns about systems of more than a few thousand solvents.

Describe the solution you'd like

An feature like Topology.add_molecules which takes in list[Molecule | _SimpleMoleucle], adds them all at once, and only updates the cache once, just prior to returning. I'd prefer a new method or a polymorphic behavior of the existing Topology.add_molecule.

An alternative is to expose an argument to Topology.add_molecule that skips cache management (._add_molecule_keep_cache) and pushes the responsibility back on the user, who should call topology._invalidate_cached_properties later. I think this is dangerous.

Describe alternatives you've considered

Hook into private attributes and methods.

Additional context

We've had to deal with cache management before and it's been messy, i.e.:

https://github.com/openforcefield/openff-toolkit/blob/1feb83d484b5f3f885007f76f8ff34d93fb42d91/openff/toolkit/topology/molecule.py#L3006

j-wags commented 2 months ago

Good catch and great writeup. I think I'm slightly in favor of making add_molecule polymorphic so it can accept an iterable of Molecule | _SimpleMolecule. Happy to take a PR for this!