openforcefield / qca-dataset-submission

Data generation and submission scripts for the QCArchive ecosystem.
Other
29 stars 6 forks source link

Run OpenMM Parsley torsion scans #79

Open ChayaSt opened 4 years ago

ChayaSt commented 4 years ago

I want to run torsion scans with Parsley and OpenMM on the OpenFF Substituted Phenyl Set 1. The JSON for the inputs are here:

https://github.com/openforcefield/qca-dataset-submission/blob/master/2019-07-25-phenyl-set/phenyl_set_input.tar.jz and https://github.com/openforcefield/qca-dataset-submission/blob/master/2019-07-25-phenyl-set/biphenyls_set_input.tar.gz

The second JSON file is higher priority than the first so those should be run first.

Thanks!

dgasmith commented 4 years ago

Can we just run this on all of OpenFF Substituted Phenyl Set 1?

ChayaSt commented 4 years ago

Yes - All the molecules in the linked JSONs are in the OpenF Substituted Phenyl Set 1

dgasmith commented 4 years ago

Still some issues in running this through QCEngine:

import qcfractal.interface as ptl
import qcengine as qcng

client = ptl.FractalClient()

ds = client.get_collection("torsiondrivedataset", "OpenFF Substituted Phenyl Set 1")
ds.query("default")

good = []
for idx, row in ds.df["default"].items():

    mol = client.query_molecules(id=row.initial_molecule[0])[0]

    inp = {
    "molecule": mol,
    "driver": "energy",
    "model": {"method": "openff-1.0.0", "basis": "smirnoff"}
    }

    ret = qcng.compute(inp, "openmm")
    if ret.success:
        good.append(idx)

print(good)
print(len(good))
print(ds.df.shape)

78/154 succeed at the moment.

j-wags commented 4 years ago

I didn't run the whole set, but many of the errors seemed to be RDKit's "a neutral nitrogen has four bonds" complaint. I see that QCEngine screens to remove molecules with net charge, but in this case, zwitterions like a nitro group (with N+ and O-) are causing the trouble. The relevant code is here: https://github.com/MolSSI/QCEngine/blob/c8ae155267a69af0be2a63034ea9224bbebd9d34/qcengine/programs/rdkit.py#L33-L74

The next release of OFFTK (mid-March, I think) will also include @jthorton's OFFMol.from_qcschema and OFFMol.from_tagged_smiles functions, either of which should be able to solve this as long as there's a (S/C)MILES around.

Another option (and I haven't thought this fully through) is that knowing net charge, elements, connectivity (with bond orders), and explicit Hs, MAY let us guess suitable formal charges for a graph mol. I did this in a notebook to see if it would work in the RDKit Harness and it may be dangerous, but it works. Here's a code snippet showing how the RDKit Harness code could be modified to work for the example above.

from rdkit import Chem
from qcengine.units import ureg
import qcportal as ptl
import qcengine as qcng

client = ptl.FractalClient()

ds = client.get_collection("torsiondrivedataset", "OpenFF Substituted Phenyl Set 1")
ds.query("default")

good = []
items = [i for i in ds.df["default"].items()]
idx, row = items[7]
mol = client.query_molecules(id=row.initial_molecule[0])[0]

### This part is in RDKitHarness._process_molecule_rdkit

# Handle errors
if abs(jmol.molecular_charge) > 1.0e-6:
    raise InputError("RDKit does not currently support charged molecules.")

if not jmol.connectivity:  # Check for empty list
    raise InputError("RDKit requires molecules to have a connectivity graph.")

# Build out the base molecule
base_mol = Chem.Mol()
rw_mol = Chem.RWMol(base_mol)
for sym in jmol.symbols:
    rw_mol.AddAtom(Chem.Atom(sym.title()))

# Add in connectivity
bond_types = {1: Chem.BondType.SINGLE, 2: Chem.BondType.DOUBLE, 3: Chem.BondType.TRIPLE}
for atom1, atom2, bo in jmol.connectivity:
    rw_mol.AddBond(atom1, atom2, bond_types[bo])

# Predict formal charges on atoms (needed to construct valid graph molecule)
# NOTE: This is only sufficient if 
# 1) all hydrogens are explicit
# 2) bond orders are known
# 3) The net charge on the molecule is known (possible valences of P and S can introduce ambiguity)
neutral_valences = {'H': (1,), 'C': (4,), 'O': (2,), 
                    'N': (3,), 'P': (4, 6), 'S': (2, 6), 
                    'F': (1,), 'Cl': (1,), 'Br': (1,), 
                    'I': (1,)}
net_charge = 0
for atom in rw_mol.GetAtoms():
    # Get the sum of the bond orders involving this atom
    valence_sum = sum([bond.GetBondType() for bond in atom.GetBonds()])
    element_symbol = qcelemental.periodictable.E[atom.GetAtomicNum()]
    if element_symbol not in neutral_valences:
        raise InputError(f"RDKit harness does not currently support element {element_symbol}")
    possible_charges = [valence_sum - neutral_valence for neutral_valence in neutral_valences[element_symbol]]
    # Sort to put the lowest-magnitude charge first
    possible_charges.sort(key=lambda x: abs(x))
    atom_charge = possible_charges[0]
    atom.SetFormalCharge(atom_charge)
    net_charge += atom_charge
if net_charge != jmol.molecular_charge:
    raise InputError(f"RDKit harness is unable to deduce formal charges on molecule {jmol}.")

mol = rw_mol.GetMol()

# Write out the conformer
natom = len(jmol.symbols)
conf = Chem.Conformer(natom)
bohr2ang = ureg.conversion_factor("bohr", "angstrom")
for line in range(natom):
    conf.SetAtomPosition(
        line,
        (
            bohr2ang * jmol.geometry[line, 0],
            bohr2ang * jmol.geometry[line, 1],
            bohr2ang * jmol.geometry[line, 2],
        ),
    )

mol.AddConformer(conf)

Chem.rdmolops.SanitizeMol(mol)

print(Chem.MolToSmiles(mol))

[H]c1c([H])c(C(=O)OC([H])([H])C([H])([H])[H])c([H])c(N+[O-])c1[H]

dgasmith commented 4 years ago

Any thoughts on waiting for the OFFTK release or changing the current scheme to reflect above?

Pining @jchodera as well.

j-wags commented 4 years ago

Talked to @jthorton a bit about this today. Here's the our take on this:

@jthorton pointed out that we wouldn't need to make the QCEngine call take both the initial and final molecules if we could sneak CMILES into the extras field of final molecules (as extras['initial_canonical_isomeric_explicit_hydrogen_mapped_smiles'])

jchodera commented 4 years ago

@dgasmith and I chatted about this recently. I still have to catch up on the details, but is it possible this is just a case where we need to harmonize the bond order scheme used by QCElemental with the bond order scheme we use within openforcefield.topology.Molecule to ensure we can interconvert with RDKit/OpenEye?

We should also be using the openforcefield.topology.Molecule object for all of the instantiation of toolkit molecules, simplifying the above conversion examples to something like

from openforcefield.topology import Molecule
molecule = Molecule.from_qcarchive(qcarchive_json)
rdmol = molecule.to_rdkit()
j-wags commented 4 years ago

is it possible this is just a case where we need to harmonize the bond order scheme used by QCElemental with the bond order scheme we use within openforcefield.topology.Molecule to ensure we can interconvert with RDKit/OpenEye?

No, the issue is that graph molecules need atomic formal charges, which it isn't trivial to unambiguously deduce from the QCElemental molecule's information (connectivity + bond orders + elements + total charge). This is because since some atoms can have multiple accepted valence states, for example sulfur can have two bonds (thiol) or six bonds (sulfonamide).

If we hammer on the code for deducing the formal charge a bit, we could come up with a pretty good solution. But, like all code, it would make assumptions and have bugs, so we will want to keep a versioning scheme for it handy.

jchodera commented 4 years ago

Thanks for clarifying!

I guess the important questions we need to ask are:

  1. Do we actually need to set the formal charges when creating a toolkit molecule?
  2. If so, if we use a simple heuristic to assign formal charges based on valence and net charge, would the cheminformatics toolkits correct the formal charges if they are incorrect?
  3. Even if the toolkits don't correct the formal charges, are there any consequences for any of the operations we carry out (such as assigning AM1-BCC charges or generating conformers) if the formal charges are incorrect but the valence and net charge are correct?

If we do need a simple heuristic, iterating over combinations of valid formal charge combinations in a manner that places negative charge on more electronegative molecules should work just fine.

jchodera commented 4 years ago

This paper could be useful in the future: [PDF]

ChayaSt commented 4 years ago

Do we actually need to set the formal charges when creating a toolkit molecule?

As far as I know, RDKit needs the formal charge. Otherwise it will throw an error that the valance is wrong.

jchodera commented 4 years ago

As far as I know, RDKit needs the formal charge. Otherwise it will throw an error that the valance is wrong.

@ChayaSt : What happens if we have one or more sites that have multiple possible valences, and we provide valid (but suboptimal) formal charges? Does it have any consequence for generating conformers or partial charges?

j-wags commented 4 years ago

Do we actually need to set the formal charges when creating a toolkit molecule?

Yes. We have some SMARTS that get matched to specific formal charges https://github.com/openforcefield/openforcefields/blob/master/openforcefields/offxml/openff-1.0.0.offxml#L24

If so, if we use a simple heuristic to assign formal charges based on valence and net charge, would the cheminformatics toolkits correct the formal charges if they are incorrect?

Not sure. This issue originally came up because RDKit was being strict about nitrogens with too many bonds, so we'd need to look into to sanitization options that aren't part of our normal workflow. However there is probably a way to make it guess... Chem.MolFromPDB must do something for this problem.

Even if the toolkits don't correct the formal charges, are there any consequences for any of the operations we carry out (such as assigning AM1-BCC charges or generating conformers) if the formal charges are incorrect but the valence and net charge are correct?

It shouldn't matter for QM -- the methods I know of don't WANT to know the formal charges. Just the total charge on the molecule. Conf gen might need to know -- AFAIK they'll use MMFF94 which probably needs knowledge of charges. And, per above, our SMARTS matching requires knowledge of formal charges.

j-wags commented 4 years ago

@mattwthompson This may be largely solved by #472, though it would be good to test that solution. Further extensions may try to detect proton migrations/connectivity rearrangements by comparing the connectivity derived from geometry or Wiberg Bond Orders to that described in the CMILES.

ChayaSt commented 4 years ago

@j-wags @bennybp, what is the status of the openmm torsion scans? I tried checking the status and I'm getting an error:

import qcfractal.interface as ptl
client = ptl.FractalClient()
dataset = client.get_collection('TorsionDriveDataset', 'OpenFF Substituted Phenyl Set 1')
dataset.status('openmm')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in get_specification(self, name)
    372         try:
--> 373             return self.data.specs[name.lower()].copy()
    374         except KeyError:

KeyError: 'openmm'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-1-73c09af38155> in <module>
      2 client = ptl.FractalClient()
      3 dataset = client.get_collection('TorsionDriveDataset', 'OpenFF Substituted Phenyl Set 1')
----> 4 dataset.status('openmm')

~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in status(self, specs, collapse, status, detail)
    580                 list_specs = []
    581                 for spec in specs:
--> 582                     list_specs.append(self.query(spec))
    583 
    584             def get_status(item):

~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in query(self, specification, force)
    513         """
    514         # Try to get the specification, will throw if not found.
--> 515         spec = self.get_specification(specification)
    516 
    517         if not force and (spec.name in self.df):

~/src/molssi/QCFractal/qcfractal/interface/collections/collection.py in get_specification(self, name)
    373             return self.data.specs[name.lower()].copy()
    374         except KeyError:
--> 375             raise KeyError(f"Specification '{name}' not found.")
    376 
    377     def list_specifications(self, description=True) -> Union[List[str], pd.DataFrame]:

KeyError: "Specification 'openmm' not found."
bennybp commented 4 years ago

@ChayaSt The name of the specification (first argument to status) is default

dataset.status('default') gives me 151 complete, 3 errors

(see dataset.list_specifications())

ChayaSt commented 4 years ago

@bennybp, that is for the B3lyp torsion scans. I'm trying to access the MM torsion drive.

On second thought, this might have not been fully deployed in QCEngine.