Open ChayaSt opened 4 years ago
Can we just run this on all of OpenFF Substituted Phenyl Set 1
?
Yes - All the molecules in the linked JSONs are in the OpenF Substituted Phenyl Set 1
Still some issues in running this through QCEngine:
import qcfractal.interface as ptl
import qcengine as qcng
client = ptl.FractalClient()
ds = client.get_collection("torsiondrivedataset", "OpenFF Substituted Phenyl Set 1")
ds.query("default")
good = []
for idx, row in ds.df["default"].items():
mol = client.query_molecules(id=row.initial_molecule[0])[0]
inp = {
"molecule": mol,
"driver": "energy",
"model": {"method": "openff-1.0.0", "basis": "smirnoff"}
}
ret = qcng.compute(inp, "openmm")
if ret.success:
good.append(idx)
print(good)
print(len(good))
print(ds.df.shape)
78/154 succeed at the moment.
I didn't run the whole set, but many of the errors seemed to be RDKit's "a neutral nitrogen has four bonds" complaint. I see that QCEngine screens to remove molecules with net charge, but in this case, zwitterions like a nitro group (with N+
and O-
) are causing the trouble. The relevant code is here: https://github.com/MolSSI/QCEngine/blob/c8ae155267a69af0be2a63034ea9224bbebd9d34/qcengine/programs/rdkit.py#L33-L74
The next release of OFFTK (mid-March, I think) will also include @jthorton's OFFMol.from_qcschema
and OFFMol.from_tagged_smiles
functions, either of which should be able to solve this as long as there's a (S/C)MILES around.
Another option (and I haven't thought this fully through) is that knowing net charge, elements, connectivity (with bond orders), and explicit Hs, MAY let us guess suitable formal charges for a graph mol. I did this in a notebook to see if it would work in the RDKit Harness and it may be dangerous, but it works. Here's a code snippet showing how the RDKit Harness code could be modified to work for the example above.
from rdkit import Chem
from qcengine.units import ureg
import qcportal as ptl
import qcengine as qcng
client = ptl.FractalClient()
ds = client.get_collection("torsiondrivedataset", "OpenFF Substituted Phenyl Set 1")
ds.query("default")
good = []
items = [i for i in ds.df["default"].items()]
idx, row = items[7]
mol = client.query_molecules(id=row.initial_molecule[0])[0]
### This part is in RDKitHarness._process_molecule_rdkit
# Handle errors
if abs(jmol.molecular_charge) > 1.0e-6:
raise InputError("RDKit does not currently support charged molecules.")
if not jmol.connectivity: # Check for empty list
raise InputError("RDKit requires molecules to have a connectivity graph.")
# Build out the base molecule
base_mol = Chem.Mol()
rw_mol = Chem.RWMol(base_mol)
for sym in jmol.symbols:
rw_mol.AddAtom(Chem.Atom(sym.title()))
# Add in connectivity
bond_types = {1: Chem.BondType.SINGLE, 2: Chem.BondType.DOUBLE, 3: Chem.BondType.TRIPLE}
for atom1, atom2, bo in jmol.connectivity:
rw_mol.AddBond(atom1, atom2, bond_types[bo])
# Predict formal charges on atoms (needed to construct valid graph molecule)
# NOTE: This is only sufficient if
# 1) all hydrogens are explicit
# 2) bond orders are known
# 3) The net charge on the molecule is known (possible valences of P and S can introduce ambiguity)
neutral_valences = {'H': (1,), 'C': (4,), 'O': (2,),
'N': (3,), 'P': (4, 6), 'S': (2, 6),
'F': (1,), 'Cl': (1,), 'Br': (1,),
'I': (1,)}
net_charge = 0
for atom in rw_mol.GetAtoms():
# Get the sum of the bond orders involving this atom
valence_sum = sum([bond.GetBondType() for bond in atom.GetBonds()])
element_symbol = qcelemental.periodictable.E[atom.GetAtomicNum()]
if element_symbol not in neutral_valences:
raise InputError(f"RDKit harness does not currently support element {element_symbol}")
possible_charges = [valence_sum - neutral_valence for neutral_valence in neutral_valences[element_symbol]]
# Sort to put the lowest-magnitude charge first
possible_charges.sort(key=lambda x: abs(x))
atom_charge = possible_charges[0]
atom.SetFormalCharge(atom_charge)
net_charge += atom_charge
if net_charge != jmol.molecular_charge:
raise InputError(f"RDKit harness is unable to deduce formal charges on molecule {jmol}.")
mol = rw_mol.GetMol()
# Write out the conformer
natom = len(jmol.symbols)
conf = Chem.Conformer(natom)
bohr2ang = ureg.conversion_factor("bohr", "angstrom")
for line in range(natom):
conf.SetAtomPosition(
line,
(
bohr2ang * jmol.geometry[line, 0],
bohr2ang * jmol.geometry[line, 1],
bohr2ang * jmol.geometry[line, 2],
),
)
mol.AddConformer(conf)
Chem.rdmolops.SanitizeMol(mol)
print(Chem.MolToSmiles(mol))
[H]c1c([H])c(C(=O)OC([H])([H])C([H])([H])[H])c([H])c(N+[O-])c1[H]
Any thoughts on waiting for the OFFTK release or changing the current scheme to reflect above?
Pining @jchodera as well.
Talked to @jthorton a bit about this today. Here's the our take on this:
@jthorton pointed out that we wouldn't need to make the QCEngine call take both the initial and final molecules if we could sneak CMILES into the extras
field of final molecules (as extras['initial_canonical_isomeric_explicit_hydrogen_mapped_smiles']
)
@dgasmith and I chatted about this recently. I still have to catch up on the details, but is it possible this is just a case where we need to harmonize the bond order scheme used by QCElemental with the bond order scheme we use within openforcefield.topology.Molecule
to ensure we can interconvert with RDKit/OpenEye?
We should also be using the openforcefield.topology.Molecule
object for all of the instantiation of toolkit molecules, simplifying the above conversion examples to something like
from openforcefield.topology import Molecule
molecule = Molecule.from_qcarchive(qcarchive_json)
rdmol = molecule.to_rdkit()
is it possible this is just a case where we need to harmonize the bond order scheme used by QCElemental with the bond order scheme we use within openforcefield.topology.Molecule to ensure we can interconvert with RDKit/OpenEye?
No, the issue is that graph molecules need atomic formal charges, which it isn't trivial to unambiguously deduce from the QCElemental molecule's information (connectivity + bond orders + elements + total charge). This is because since some atoms can have multiple accepted valence states, for example sulfur can have two bonds (thiol) or six bonds (sulfonamide).
If we hammer on the code for deducing the formal charge a bit, we could come up with a pretty good solution. But, like all code, it would make assumptions and have bugs, so we will want to keep a versioning scheme for it handy.
Thanks for clarifying!
I guess the important questions we need to ask are:
If we do need a simple heuristic, iterating over combinations of valid formal charge combinations in a manner that places negative charge on more electronegative molecules should work just fine.
Do we actually need to set the formal charges when creating a toolkit molecule?
As far as I know, RDKit needs the formal charge. Otherwise it will throw an error that the valance is wrong.
As far as I know, RDKit needs the formal charge. Otherwise it will throw an error that the valance is wrong.
@ChayaSt : What happens if we have one or more sites that have multiple possible valences, and we provide valid (but suboptimal) formal charges? Does it have any consequence for generating conformers or partial charges?
Do we actually need to set the formal charges when creating a toolkit molecule?
Yes. We have some SMARTS that get matched to specific formal charges https://github.com/openforcefield/openforcefields/blob/master/openforcefields/offxml/openff-1.0.0.offxml#L24
If so, if we use a simple heuristic to assign formal charges based on valence and net charge, would the cheminformatics toolkits correct the formal charges if they are incorrect?
Not sure. This issue originally came up because RDKit was being strict about nitrogens with too many bonds, so we'd need to look into to sanitization options that aren't part of our normal workflow. However there is probably a way to make it guess... Chem.MolFromPDB
must do something for this problem.
Even if the toolkits don't correct the formal charges, are there any consequences for any of the operations we carry out (such as assigning AM1-BCC charges or generating conformers) if the formal charges are incorrect but the valence and net charge are correct?
It shouldn't matter for QM -- the methods I know of don't WANT to know the formal charges. Just the total charge on the molecule. Conf gen might need to know -- AFAIK they'll use MMFF94 which probably needs knowledge of charges. And, per above, our SMARTS matching requires knowledge of formal charges.
@mattwthompson This may be largely solved by #472, though it would be good to test that solution. Further extensions may try to detect proton migrations/connectivity rearrangements by comparing the connectivity derived from geometry or Wiberg Bond Orders to that described in the CMILES.
@j-wags @bennybp, what is the status of the openmm torsion scans? I tried checking the status and I'm getting an error:
import qcfractal.interface as ptl
client = ptl.FractalClient()
dataset = client.get_collection('TorsionDriveDataset', 'OpenFF Substituted Phenyl Set 1')
dataset.status('openmm')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in get_specification(self, name)
372 try:
--> 373 return self.data.specs[name.lower()].copy()
374 except KeyError:
KeyError: 'openmm'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-1-73c09af38155> in <module>
2 client = ptl.FractalClient()
3 dataset = client.get_collection('TorsionDriveDataset', 'OpenFF Substituted Phenyl Set 1')
----> 4 dataset.status('openmm')
~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in status(self, specs, collapse, status, detail)
580 list_specs = []
581 for spec in specs:
--> 582 list_specs.append(self.query(spec))
583
584 def get_status(item):
~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in query(self, specification, force)
513 """
514 # Try to get the specification, will throw if not found.
--> 515 spec = self.get_specification(specification)
516
517 if not force and (spec.name in self.df):
~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in get_specification(self, name)
373 return self.data.specs[name.lower()].copy()
374 except KeyError:
--> 375 raise KeyError(f"Specification '{name}' not found.")
376
377 def list_specifications(self, description=True) -> Union[List[str], pd.DataFrame]:
KeyError: "Specification 'openmm' not found."
@ChayaSt The name of the specification (first argument to status) is default
dataset.status('default')
gives me 151 complete, 3 errors
(see dataset.list_specifications()
)
@bennybp, that is for the B3lyp torsion scans. I'm trying to access the MM torsion drive.
On second thought, this might have not been fully deployed in QCEngine.
I want to run torsion scans with Parsley and OpenMM on the OpenFF Substituted Phenyl Set 1. The JSON for the inputs are here:
https://github.com/openforcefield/qca-dataset-submission/blob/master/2019-07-25-phenyl-set/phenyl_set_input.tar.jz and https://github.com/openforcefield/qca-dataset-submission/blob/master/2019-07-25-phenyl-set/biphenyls_set_input.tar.gz
The second JSON file is higher priority than the first so those should be run first.
Thanks!