The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
I'm also running into significant memory issues when adding more than ~5,000 waters, but unable to clearly isolate and report them.
The key difference is that Topology.from_molecules adds all molecules at once and then updates the cache whereas Topology.add_molecule adds the molecule (fast) updates the cache (not as fast) on each call. This behavior difference is not documented (and I believe this docstring is incorrect.) In the above reproduction, there isn't any reason to use .add_molecule, but in many use cases involving macromolecules, we need to add individual molecules to an existing topology, sometimes many. The scaling here appears to be super-linear and raises concerns about systems of more than a few thousand solvents.
Describe the solution you'd like
An feature like Topology.add_molecules which takes in list[Molecule | _SimpleMoleucle], adds them all at once, and only updates the cache once, just prior to returning. I'd prefer a new method or a polymorphic behavior of the existing Topology.add_molecule.
An alternative is to expose an argument to Topology.add_molecule that skips cache management (._add_molecule_keep_cache) and pushes the responsibility back on the user, who should call topology._invalidate_cached_properties later. I think this is dangerous.
Describe alternatives you've considered
Hook into private attributes and methods.
Additional context
We've had to deal with cache management before and it's been messy, i.e.:
Good catch and great writeup. I think I'm slightly in favor of making add_molecule polymorphic so it can accept an iterable of Molecule | _SimpleMolecule. Happy to take a PR for this!
Is your feature request related to a problem? Please describe.
Topology.add_molecule
is slow when called many (hundreds or thousands) of times:Out:
n | add_molecule | from_molecules
-- | -- | -- 100 | 0.005265951157 | 0.003153800964 200 | 0.01143503189 | 0.006229877472 300 | 0.02010607719 | 0.009879112244 400 | 0.03147602081 | 0.01276278496 500 | 0.04498314857 | 0.01589989662 600 | 0.09369587898 | 0.02003026009 700 | 0.07955312729 | 0.02353787422 800 | 0.09967899323 | 0.05754804611 900 | 0.1233439445 | 0.03064894676 1000 | 0.1784200668 | 0.03322601318 1100 | 0.1743271351 | 0.03602099419 1200 | 0.2516520023 | 0.03938388824 1300 | 0.2655379772 | 0.04273581505 1400 | 0.2654650211 | 0.08011960983 1500 | 0.3049080372 | 0.07958674431 1600 | 0.3679380417 | 0.09016990662 1700 | 0.4580187798 | 0.09267616272 1800 | 0.4605190754 | 0.09686183929 1900 | 0.4915962219 | 0.09718298912 2000 | 0.5896348953 | 0.06572294235 2100 | 0.6378009319 | 0.06884002686 2200 | 0.6888310909 | 0.1076860428 2300 | 0.7243771553 | 0.1178209782 2400 | 0.8062388897 | 0.1126852036 2500 | 0.8307061195 | 0.1238048077 2600 | 0.9382829666 | 0.1206080914 2700 | 0.9951388836 | 0.1316249371 2800 | 1.053414106 | 0.1318500042 2900 | 1.251863241 | 0.133026123 3000 | 1.702547073 | 0.1354470253 3100 | 1.217139959 | 0.1501860619 3200 | 1.318310738 | 0.140696764 3300 | 1.408802986 | 0.1381850243 3400 | 1.475514174 | 0.1486041546 3500 | 1.566795111 | 0.1464569569 3600 | 1.677216053 | 0.1541819572 3700 | 1.740601778 | 0.1566593647 3800 | 1.842238903 | 0.1587378979 3900 | 1.924557209 | 0.1610488892 4000 | 2.058802128 | 0.1687648296 4100 | 2.113446951 | 0.1700348854 4200 | 2.231441736 | 0.1775681973 4300 | 2.388463259 | 0.174751997 4400 | 2.510241032 | 0.1843349934 4500 | 2.543238163 | 0.2115340233 4600 | 2.653229713 | 0.2233736515 4700 | 2.79717803 | 0.1896550655 4800 | 2.962532282 | 0.1973302364 4900 | 3.24529314 | 0.2079119682 5000 | 3.197937727 | 0.2045538425
I'm also running into significant memory issues when adding more than ~5,000 waters, but unable to clearly isolate and report them.
The key difference is that
Topology.from_molecules
adds all molecules at once and then updates the cache whereasTopology.add_molecule
adds the molecule (fast) updates the cache (not as fast) on each call. This behavior difference is not documented (and I believe this docstring is incorrect.) In the above reproduction, there isn't any reason to use.add_molecule
, but in many use cases involving macromolecules, we need to add individual molecules to an existing topology, sometimes many. The scaling here appears to be super-linear and raises concerns about systems of more than a few thousand solvents.Describe the solution you'd like
An feature like
Topology.add_molecules
which takes inlist[Molecule | _SimpleMoleucle]
, adds them all at once, and only updates the cache once, just prior to returning. I'd prefer a new method or a polymorphic behavior of the existingTopology.add_molecule
.An alternative is to expose an argument to
Topology.add_molecule
that skips cache management (._add_molecule_keep_cache
) and pushes the responsibility back on the user, who should calltopology._invalidate_cached_properties
later. I think this is dangerous.Describe alternatives you've considered
Hook into private attributes and methods.
Additional context
We've had to deal with cache management before and it's been messy, i.e.:
https://github.com/openforcefield/openff-toolkit/blob/1feb83d484b5f3f885007f76f8ff34d93fb42d91/openff/toolkit/topology/molecule.py#L3006