openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
311 stars 91 forks source link

Antechamber/sqm partial charge calculation failed on molecule molecule #567

Closed tarungog closed 2 years ago

tarungog commented 4 years ago

Hello,

I am getting an error when trying to calculate energy on a molecule via the openff and openmm packages.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import py3Dmol
from simtk.openmm import openmm

from openforcefield.topology import Molecule

# Define the keyword arguments to feed to ForceField
from simtk import unit
from simtk.openmm import app

from rdkit import Chem
molOrig = Chem.MolFromSmiles("[H]OC([H])([H])/C([H])=C(\\[H])c1c([H])c([H])c(Oc2c([H])c(C([H])(Oc3c(OC([H])([H])[H])c([H])c(/C([H])=C(\\[H])C([H])([H])O[H])c([H])c3OC([H])([H])[H])C([H])(Oc3c([H])c([H])c(/C([H])=C(\\[H])C([H])([H])O[H])c([H])c3OC([H])([H])[H])C([H])([H])O[H])c([H])c(OC([H])([H])[H])c2OC([H])(C([H])([H])O[H])C([H])(O[H])c2c([H])c(Oc3c(OC([H])([H])[H])c([H])c(/C([H])=C(\\[H])C([H])([H])O[H])c([H])c3OC([H])([H])[H])c(OC([H])(C([H])([H])O[H])C([H])(O[H])c3c([H])c([H])c([O])c(OC([H])([H])[H])c3[H])c(OC([H])([H])[H])c2[H])c(OC([H])([H])[H])c1[H]")
molOrig = Chem.AddHs(molOrig)
# AllChem.EmbedMolecule(molOrig)
# Chem.rdmolops.AssignAtomChiralTagsFromStructure(molOrig)
from openforcefield.topology import Molecule
ofmol = Molecule.from_rdkit(molOrig, allow_undefined_stereo=True)
#generate_unique_atom_names(ofmol)
ofmol.generate_conformers(n_conformers=10)
pos = ofmol.conformers[0]
ofmol.name = 'molecule'

from simtk.openmm import app
from simtk import unit
forcefield_kwargs = { 'constraints' : app.HBonds, 'rigidWater' : True, 'removeCMMotion' : False, 'hydrogenMass' : 4*unit.amu }
# Initialize a SystemGenerator using GAFF
from openmmforcefields.generators import SystemGenerator
system_generator = SystemGenerator(small_molecule_forcefield='gaff-2.11', molecules=[ofmol], forcefield_kwargs=forcefield_kwargs, cache='db.json')
# Create an OpenMM System from an Open Force Field toolkit Topology object
system = system_generator.create_system(ofmol.to_topology().to_openmm())

image

Anybody have any idea what's wrong?

j-wags commented 4 years ago

Hi @tarungog,

Thanks for the reproducing example -- I can generate the same error on my computer.

When I run this, antechamber prints out a particular message to the terminal:

Info: Bond types are assigned for valence state (1) with penalty (1).
Info: Total number of electrons: 713; net charge: 0
Info: The number of electrons is odd (713).
      Please check the total charge (-nc flag) and spin multiplicity (-m flag).

My first guess based on this is that some atoms in the SMILES should have a formal charge. My next guess is that this is just a particularly large molecule, and antechamber doesn't like it.

The following would be good next steps:

tarungog commented 4 years ago

@j-wags I just checked and there is no formal charge on this molecule.

Tried a much smaller molecule and it still failed:

Exception: Antechamber/sqm partial charge calculation failed on molecule molecule (SMILES [H][O][C]([H])([H])[C]([H])=[C]([H])[C]1=[C]([H])[C]([O][C]([H])([H])[H])=[C]2[O][C]([H])([C]3=[C]([H])[C]([H])=[C]([OH])[C]([O][C]([H])([H])[H])=[C]3[H])[C]([H])([C]([H])([H])[O][H])[C]2=[C]1[H])

Here is the original smiles string of the smaller molecule if you want to try and reproduce it:

molOrig = Chem.MolFromSmiles("COc1cc(C2Oc3c(OC)cc(C=CCO)cc3C2CO)ccc1[O]")
j-wags commented 4 years ago

Hi @tarungog,

Interesting. There's something extremely subtle happening here, and it will take some time to put my finger on it. The root cause is probably related to #566, so I'd like to move discussion of long-term solutions to there.

Short-term, I think the following workaround may work for you:

ofmol2 = Molecule.from_smiles("COc1cc(C2Oc3c(OC)cc(C=CCO)cc3C2CO)ccc1[O]", 
                               allow_undefined_stereo=True)
ofmol2 = Molecule.from_smiles(ofmol2.to_smiles(), allow_undefined_stereo=True)
ofmol2.generate_conformers(n_conformers=10)
ofmol2.compute_partial_charges_am1bcc()
ofmol2.to_file('test.sdf', file_format='sdf')

The SMILES round trip forces all the hydrogens to become "really" explicit (I'm not sure that AddHs made them explicit in exactly the way we want, but I'll continue looking into this in #566)

It'll be very important to check out the resulting SDF and ensure that it yielded the protomer that you intended it to -- Different hydrogen addition methods disagree about the protomeric state of the oxygen on the aromatic ring.

Please let me know if this resolves the issue for you.

tarungog commented 4 years ago

Does this mean anything to you?

NameError: name 'topologyobjects' is not defined

when running create_system

j-wags commented 4 years ago

Hmm, I haven't seen that before. The Open Force Field Toolkit has Topology and TopologyMolecule classes, but I've never seen topologyobjects. Googling makes me think it's a ParmEd thing: https://parmed.github.io/ParmEd/html/topobj/parmed.topologyobjects.Atom.html

Can I assume that the above workaround resolved the original issue? I'd like to close the issue if so.

tarungog commented 4 years ago

@j-wags hard to say if the original question has been 'solved' until I figure out this parmed bug. I'd prefer to leave it open for the time being. I'll get back to this soon!

tarungog commented 4 years ago

@j-wags two questions before we can close this topic. Your workaround appears to be performing ok for now (thank you!!!), but it makes a couple of things difficult for my workflow.

(1) Rather than generating conformers on the openff side, my pipeline will generate conformers in RDKit that need to be minimized. With the smiles "gateway", the original RDKit conformers are lost. Can you give me a workaround for this workaround, without sacrificing too much performance?

(2) The first time this line is run on a new molecule of reasonable size (~100 atoms), it takes upwards of 2 minutes walltime and I'm sure it'll grow even longer for my molecules of interest (500-1000 atoms). Why is that? Will I have to simply preload and cache all molecules I'm interested in?

%time system = system_generator.create_system(ofmol.to_topology().to_openmm())
#CPU times: user 772 ms, sys: 47.1 ms, total: 820 ms
#Wall time: 2min 8s
davidlmobley commented 4 years ago

@tarungog I'll speak just to the speed issue and leave the others for Jeff.

Partial charges come from a semiempirical QM calculation (AM1-BCC). While fast, like all QM it doesn't scale particularly well with size, so as you get to larger systems charge assignment takes particularly long. I don't think you'll be able to use it for larger systems (500-1000 atoms).

That means you have three main options as you pass the size of molecules which can be handled with AM1-BCC, I think: 1) Preassign your own charges and bring them in rather than having the toolkit assign them; 2) Define library charges for the building blocks of your large molecules so the toolkit does not have to assign charges or 3) Wait for us to get to support for polymers/biopolymers (months to a year) where we'd have a procedure which identifies building blocks, separates them, caps them, charges them individually and then splices them back together

(Or you could do your own version of (3) and then bring in the charges via option (1).)

j-wags commented 4 years ago

Hmm, so double-checking the input, it seems that there's a particular oxygen in the input SMILES that can either be considered to have an implicit hydrogen, or to have a negative charge. It's this one at the end:

COc1cc(C2Oc3c(OC)cc(C=CCO)cc3C2CO)ccc1 [O]

The square brackets around it have a special meaning. Per the SMILES specification (sec 3.2.1):

Within brackets, any attached hydrogens and formal charges must always be specified.

So, we put RDKit in a pickle by running AddHs on this. The input SMILES specifies "this is an oxygen with one single bond to it, no hydrogens, and no formal charge". Thus, RDKit fails to turn this into a chemically sensible molecule. Removing the "no hydrogens" or "no charge" constraint yields a molecule that runs in OpenFF without issue:

Screen Shot 2020-04-10 at 10 02 14 AM

I can fix this either by specifying that the O should have an H added to fill its valence requirement, by removing the brackets:

COc1cc(C2Oc3c(OC)cc(C=CCO)cc3C2CO)ccc1 O

or specifying that the O really has no hydrogens, and instead has a negative formal charge:

COc1cc(C2Oc3c(OC)cc(C=CCO)cc3C2CO)ccc1 [O-1]

tarungog commented 4 years ago

Thank you both for responses!

@davidlmobley Can we get cruder partial charges using a different method? For example RDKit's native method (Gasteiger) seems to work pretty rapidly. Here is an example: https://sourceforge.net/p/rdkit/mailman/message/34306459/

How would I go about transferring these Gasteiger charges to the GAFF OpenMM side? Is GAFF validated to use such charges or no?

@j-wags Thank you for this discovery! We were using someone else's module to produce these biopolymers, and attracting the Smiles string via the RDKit Chem.MolToSmiles. We didn't realize the SMILES output was doing this strange thing. We have fixed this, but I'd still like your input on my questions above regarding (2). Since we don't need the smiles gateway workaround, I think (1) is more or less answered.

davidlmobley commented 4 years ago

@tarungog :

You are certainly welcome to provide whatever charges you like via the molecules you load (e.g. place them in a mol2 or -- coming soon if it's not already ready, an sdf file -- and load them into the toolkit). However, Gasteiger charges have a number of well-documented problems that you can find in the MD literature and are generally considered unsuitable for simulations in our field, so I really don't recommend those. If you insist on using them, though, you can consult our docs on Molecules and just make sure the molecules in the system you're parameterizing have charges when you go to parameterize the system, and you ask the parameter assignment routine to use the provided charges.

In terms of whether GAFF and OpenFF are validated to use such charges, the answer is "no" (see above); Gasteiger charges are considered poor for these purposes. If you want to be able to use reasonable quality charges for larger molecules you have a more painful road ahead of you.

tarungog commented 4 years ago

@davidlmobley What is the nature of that painful road? What would it take to get reasonable charges that don't take forever? If it's really that bad, I may just revert back to using RDKit's MMFF94. I don't know too much personally about force fields (I'm a machine learning guy), so I'm happy with any input you may have. I came here to find faster force field calculations at a similar level of accuracy. Molecules of size 400 atoms or so are taking around 1 second to minimize in RDKit, which puts a serious damper on my ML training. If GAFF with Gasteiger charges is problematic but no more problematic than RDKit's built in MMFF, then I'm fine with using it. Is that the case?

davidlmobley commented 4 years ago

@tarungog We are not here focused on faster FFs at a similar level of accuracy than MMFF94, but (a) more accurate FFs even if slower, and (b) infrastructure to continue improving FFs.

That said, RDKit's built-in MMFF may not truly be MMFF94 (especially when it pertains to electrostatics) and I am not sure anyone knows how accurate it is in general.

In terms of what the "more painful road" would be if you want to use reasonable charges for larger molecules -- basically someone needs to implement a procedure which involves fragmenting the larger molecules, capping them, then charging the individual component sub-moleucles, removing the caps, redistributing the charges, and linking them back together. This is doable and on the roadmap, but not something we are doing yet. It is not that hard, though, but would take a reasonable amount of developer time and is not urgent for us as it seems to be for you.