openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
309 stars 90 forks source link

`create_openmm_system` very slow for large (MW > 500) small molecules #395

Closed pyeguy closed 5 years ago

pyeguy commented 5 years ago

when creating an openmm system via ForceField.create_openmm_system the time required seems to drastically increase with MW. Currently I'm still waiting for MW ~900 molecule to finish after 30+min

To Reproduce

import openforcefield as off
from rdkit import Chem
from rdkit.Chem import AllChem
from simtk import openmm, unit
from simtk.openmm import app
from openforcefield.topology import Topology
from openforcefield.topology import Molecule
from openforcefield.typing.engines.smirnoff import ForceField
# loaded from smirnoff99Frosst package
ff = ForceField('smirnoff99Frosst-1.0.9.offxml')

# smiles for venetoclax
rdmol = Chem.MolFromSmiles("CC1(CCC(=C(C1)CN2CCN(CC2)C3=CC=C(C=C3)C(=O)NS(=O)(=O)C4=CC(=C(C=C4)N[C@H](CCN5CCOCC5)CSC6=CC=CC=C6)S(=O)(=O)C(F)(F)F)C7=CC=C(C=C7)Cl)C")

ofmol = Molecule.from_rdkit(rdmol)
topology = ofmol.to_topology()
org_system = ff.create_openmm_system(topology)

Output In AmberToolsToolkitwrapper.computer_partial_charges_am1bcc: Molecule '' has more than one conformer, but this function will only generate charges for the first one. warning get's thrown after ~10 seconds but the parameterization is still running...

Computing environment (please complete the following information):

pyeguy commented 5 years ago

Finished w/ Wall time: 50min 57s

j-wags commented 5 years ago

Thanks for the detailed issue report, @pyeguy . This is unfortunately expected behavior. The AM1 semiempirical quantum calculations are computationally expensive, and they scale poorly. On the backend, our open-source stack uses sqm from the AmberTools suite. Depending on your situation, you may be able to get an academic license for the OpenEye toolkits, which offer a higher-performance semiempirical quantum chemistry package.

Also, if you already have a desired set of partial charges calculated for your atoms, you can skip the charge generation step using the charge_from_molecules kwarg to create_openmm_system.

davidlmobley commented 5 years ago

100% agree with Jeff here. Until we have a general fragmentation scheme that can break larger molecules up into pieces and parameterize them consistently before stitching them back together, or an alternative charging scheme (ML-based, perhaps) which is adequate for larger molecules, we are stuck in this world. We're still running a QM calculation on the whole molecule so it's going to be slow.

jchodera commented 5 years ago

We are working on several strategies to accelerate this, but it will likely be a few months before we can replace toolkit AM1-BCC charges with something significantly faster.

pyeguy commented 5 years ago

Thanks for the quick replies and all the good work here!

I think @davidlmobley 's approach would probably work great for my application where I have a series of highly related molecules I would like to paramaterize and then simulate.

in the meantime I can use the Gastier approximations from rdkit but I assume those are rather dreadful...

I'm sure this is a lot of work as well but would switching to a more parallel QM opensource stack help with performance ie CP2K interfaced via pycp2k

jaimergp commented 5 years ago

Once I had to parametrize a ~400 atom ligand, so sqm obviously choked on that for hours and hours, trying to minimize the structure. In the end, I minimized the ligand using Gaussian with a semiempirical method, and then supplied those coordinates to the Antechamber stack.

Maybe you can use ase to minimize the structure with some other program before passing it to the sqm via openforcefield?

j-wags commented 5 years ago

@pyeguy Thanks for putting CP2K on our radar. Right now we need to stick to dependencies which are conda-installable to make our deployment fast and easy, but I'll keep an eye on that to see if it becomes easier to install. Based on some work last week, I've found that conda does offer access to gfortran and gcc, but IIRC they were version 4.5.X on mac, which falls short of CP2K's requirements..

Again, thanks for the feedback. If CP2K gets into a conda package, I'd love to try out their implementation!