Remove dummy atoms - Githubissues

lilyminium commented 3 years ago

Describe the bug A clear and concise description of what the bug is.

~Checking molecule equality for molecules with dummy atoms raises an error because the equality check first checks the hill_formula, which requires elements. A dummy atom (atomic number 0) does not have an element and therefore raises a KeyError.~

The toolkit looks like it supports dummy atoms, but it doesn't. It should validate the input lest this come as a surprise to users. Please ignore the remainder of the issue, which was made under the assumption that dummy atoms are supported.

To Reproduce Steps to reproduce the behavior. A minimal reproducing set of python commands is ideal.

If the problem involves a specific molecule or file, please upload that as well.

from openff.toolkit.topology import Molecule
mol = Molecule.from_smiles("[*]H")
mol == mol

Output The full error message (may be large, that's fine. Better to paste too much than too little.)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-107-3735b8ff570b> in <module>
      1 mol = Molecule.from_smiles("[*]H")
----> 2 mol == mol

openff/toolkit/topology/molecule.py in __eq__(self, other)
   2226         # updated to use the new isomorphic checking method, with full matching
   2227         # TODO the doc string did not match the previous function what matching should this method do?
-> 2228         return Molecule.are_isomorphic(self, other, return_atom_map=False)[0]
   2229 
   2230     def to_smiles(

openff/toolkit/topology/molecule.py in are_isomorphic(mol1, mol2, return_atom_map, aromatic_matching, formal_charge_matching, bond_order_matching, atom_stereochemistry_matching, bond_stereochemistry_matching, strip_pyrimidal_n_atom_stereo, toolkit_registry)
   2576 
   2577         # Do a quick hill formula check first
-> 2578         if Molecule.to_hill_formula(mol1) != Molecule.to_hill_formula(mol2):
   2579             return False, None
   2580 

openff/toolkit/topology/molecule.py in to_hill_formula(molecule)
   3890         from collections import Counter
   3891 
-> 3892         atom_symbol_counts = Counter(
   3893             Element.getByAtomicNumber(atom_num).symbol for atom_num in atom_nums
   3894         )

~/anaconda3/envs/polymetrizer-viz/lib/python3.8/collections/__init__.py in __init__(self, iterable, **kwds)
    550         '''
    551         super(Counter, self).__init__()
--> 552         self.update(iterable, **kwds)
    553 
    554     def __missing__(self, key):

~/anaconda3/envs/polymetrizer-viz/lib/python3.8/collections/__init__.py in update(self, iterable, **kwds)
    635                     super(Counter, self).update(iterable) # fast path when counter is empty
    636             else:
--> 637                 _count_elements(self, iterable)
    638         if kwds:
    639             self.update(kwds)

openff/toolkit/topology/molecule.py in <genexpr>(.0)
   3891 
   3892         atom_symbol_counts = Counter(
-> 3893             Element.getByAtomicNumber(atom_num).symbol for atom_num in atom_nums
   3894         )
   3895 

simtk/openmm/app/element.py in getByAtomicNumber(atomic_number)
    105     @staticmethod
    106     def getByAtomicNumber(atomic_number):
--> 107         return Element._elements_by_atomic_number[atomic_number]
    108 
    109     @staticmethod

KeyError: 0

Computing environment (please complete the following information):

Operating system
Output of running conda list

Additional context Add any other context about the problem here.

Working molecule:

mol = Molecule.from_smiles("[C]H")
mol == mol

In general the KeyError leads to some unexpected behaviour. Could the toolkit return an element of None for all invalid atomic numbers?

j-wags commented 3 years ago

Hm, in general I don't think we want to support dummy atoms -- Anything that becomes an OFFMol should be a "real molecule" that can be simulated. Though maybe I'm missing a use case here?

CC @jeff231li, who had asked about dummy atoms before (though I don't think he wanted them in a molecule's graph, just as a placeholder in the topology)

lilyminium commented 3 years ago

In that case I guess you'd want to validate inputs from SMILES, RDKit etc. because at the moment it does look like dummy atoms are supported -- I predict that people like me will continue to raise similar issues until this is made clear. My use case is that I was writing code inspired by @SimonBoothroyd's constructure to build molecules and using the OpenFF Molecule to hold information as a way to straddle OpenEye and RDKit :)

lilyminium commented 3 years ago

To put it another way, as a user I would expect any Molecule I successfully created to be able to use the methods associated with Molecule, particularly something as fundamental as __eq__. I don't mind if I can't successfully create a Molecule with dummy atoms, but since I can, I am surprised at the behaviour reported here.

Although I hope I can continue to make Molecules with dummy atoms so I don't have to re-write my code 😅

adalke commented 3 years ago

Still getting used to using GitHub. Left a comment in the PR at https://github.com/openforcefield/openff-toolkit/pull/978 rather than realizing the discussion was here.

In short, adding dummy atom support (wildcard "*" atom in SMILES, "R" atom in SDF) requires changing simtk to have an Element with atomic number 0, or requires patching around its lack of support.

adalke commented 3 years ago

My use case is that I was writing code inspired by @SimonBoothroyd's constructure to build molecules and using the OpenFF Molecule to hold information as a way to straddle OpenEye and RDKit :)

I took a look at that package. I think there are a couple of approaches you could take.

One is to switch to using some other atom as the intermediate, like "Np" instead of "*". It's unlikely you'll find neptunium in your data set!

Another might be to do the enumeration completely in SMILES space, rather than going through intermediate wildcards and a toolkit-based reaction. Here's a decidedly incomplete sketch, which depends on the existing layout with ([R1]), ([R2]) ... in the scaffold SMILES and [R] as the first three characters of the R-group SMILES extension:

import re
from dataclasses import dataclass
from typing import Dict, List
import itertools

@dataclass
class Scaffold:
    smiles: str
    r_groups: Dict[int, List[str]]

SCAFFOLDS = {
    # cation
    # anion
    # carbonyl compound
    "aldehyde": Scaffold(
        smiles="C([R1])(=O)", r_groups={1: ["hydrogen", "alkyl", "aryl"]}
    ),
    "ketone": Scaffold(
        smiles="C([R1])(=O)([R2])",
        r_groups={1: ["alkyl", "aryl"], 2: ["alkyl", "aryl"]},
    ),
}

_functional_groups = {
    "hydrogen": {
        "hydrogen": "[R][H]",
        },

    # Alkanes
    "alkyl": {
        "methyl": "[R]C",
        "ethyl": "[R]CC",
        "isopropyl": "[R]C(C)(C)",
        "tert-butyl": "[R]C(C)(C)(C)",
        },
    # Aryl
    "aryl": {
        "phenyl": "[R]c1ccccc1",
        "pyridyl": "[R]c1ccncc1",
        "benzyl": "[R]Cc1ccccc1",
        },
    }
FUNCTIONAL_GROUPS = {group_name: list(r_groups.values()) for (group_name, r_groups) in _functional_groups.items()}

SUBSTITUENTS = {}
for group_name, group_substituents in _functional_groups.items():
    SUBSTITUENTS.update(group_substituents)

_Rn_atom_pat = re.compile(r"\(\[R([1-9])+]\)")

def enumerate_combinations(scaffold, substituents):
    smiles = scaffold.smiles
    parts = []

    # Remove the leading '[R]' and surround by "()"s for substitution in ([R1]), etc. matches
    trimmed_substituents = {}
    for r_group_idx, r_groups in substituents.items():
        trimmed_r_groups = []
        for r_group in r_groups:
            if not r_group.startswith("[R]"):
                raise ValueError(f"R-groups must start with '[R]': {r_group}")
            trimmed_r_groups.append("(" + r_group[3:] + ")")
        trimmed_substituents[r_group_idx] = trimmed_r_groups

    # Find the ([R1]) etc. matches and the intermediate strings
    pos = 0
    for match in _Rn_atom_pat.finditer(smiles):
        start = match.start()
        end = match.end()
        if start > pos:
            parts.append([smiles[pos:start]]) # single element substring

        r_group = int(match.group(1))
        parts.append(trimmed_substituents[r_group])

        pos = end
    if pos < len(scaffold.smiles): # Final part of the SMILES
        parts.append( [scaffold.smiles[pos:]] )

    for terms in itertools.product(*parts): # Enumerate the component parts
        yield "".join(terms)

print(list(enumerate_combinations(
    SCAFFOLDS["ketone"], {
        1: ["[R]C", "[R]c1ccccc1"],
        2: FUNCTIONAL_GROUPS["alkyl"],
        })))

This generates:

['C(C)(=O)(C)', 'C(C)(=O)(CC)', 'C(C)(=O)(C(C)(C))', 'C(C)(=O)(C(C)(C)(C))', 'C(c1ccccc1)(=O)(C)', 'C(c1ccccc1)(=O)(CC)', 'C(c1ccccc1)(=O)(C(C)(C))', 'C(c1ccccc1)(=O)(C(C)(C)(C))']

Note that I don't touch the classification or validation in the existing code! But if you only need this chemistry-toolkit-agnostic enumeration, then perhaps something like the above, once debugged, might work in an OpenFF environment?

lilyminium commented 3 years ago

Thanks for taking a look @adalke! The code I'm referring to is actually the much less mature polymetrizer package that doesn't use substituents in the way Simon does, where each substituent has one R-group; rather, the input is multiple molecules with potentially multiple R-groups, which are each treated as Monomers or "residues" in a putative OpenFF residue-forcefield framework. polymetrizer enumerates combinations of them (Oligomers) and aggregates parametrized systems to build a residue-based force field. Right now I'm using it to check if AM1BCC charges assembled this way, from parts of a larger molecule, fall within a tolerable range of charges calculated for the entire larger molecule.

You're probably right that the whole thing could be done in SMARTS/SMILES space, though. I'm kicking myself for not thinking of that earlier... and it would be faster.

adalke commented 3 years ago

No need to kick yourself. I know how long it took me to realize how much cheminformatics could be done as syntax manipulation in SMILES space rather than molecular graph manipulation in toolkit space, and I continue to come up with new ways to work directly on SMILES strings.

One that I developed a few years ago was to re-write attachment points, noted using a wildcard like *C(P)CCO into bond closure notation like C%90(P)CCO so components could be joined like: C%90(P)CCO.CC%90. I call this method "welding", and used it in my matched molecular pair program mmpdb to join multiple fragments.

(Note that it's a bit trickier to handle stereochemisty correctly as moving the "*" form from the left to a closure on the right require swapping "@" and "@@" chirality, and that technique only works for tetrahedral stereochemistry .... which is all that RDKit supports. :) )

Many years ago there was a paper on using the the same idea - bond-closures + dot-disconnect - to turn combinatorial enumeration into a SMILES string combinatorial enumeration followed by canonicalization. See http://www.dalkescientific.com/writings/diary/archive/2004/12/12/library_generation_with_smiles.html where I describe the approach. (The paper was published earlier but I didn't know about it.)

openforcefield / openff-toolkit

Remove dummy atoms #977