Open lilyminium opened 3 years ago
Hm, in general I don't think we want to support dummy atoms -- Anything that becomes an OFFMol should be a "real molecule" that can be simulated. Though maybe I'm missing a use case here?
CC @jeff231li, who had asked about dummy atoms before (though I don't think he wanted them in a molecule's graph, just as a placeholder in the topology)
In that case I guess you'd want to validate inputs from SMILES, RDKit etc. because at the moment it does look like dummy atoms are supported -- I predict that people like me will continue to raise similar issues until this is made clear. My use case is that I was writing code inspired by @SimonBoothroyd's constructure
to build molecules and using the OpenFF Molecule to hold information as a way to straddle OpenEye and RDKit :)
To put it another way, as a user I would expect any Molecule I successfully created to be able to use the methods associated with Molecule, particularly something as fundamental as __eq__
. I don't mind if I can't successfully create a Molecule with dummy atoms, but since I can, I am surprised at the behaviour reported here.
Although I hope I can continue to make Molecules with dummy atoms so I don't have to re-write my code 😅
Still getting used to using GitHub. Left a comment in the PR at https://github.com/openforcefield/openff-toolkit/pull/978 rather than realizing the discussion was here.
In short, adding dummy atom support (wildcard "*" atom in SMILES, "R" atom in SDF) requires changing simtk to have an Element with atomic number 0, or requires patching around its lack of support.
My use case is that I was writing code inspired by @SimonBoothroyd's
constructure
to build molecules and using the OpenFF Molecule to hold information as a way to straddle OpenEye and RDKit :)
I took a look at that package. I think there are a couple of approaches you could take.
One is to switch to using some other atom as the intermediate, like "Np" instead of "*". It's unlikely you'll find neptunium in your data set!
Another might be to do the enumeration completely in SMILES space, rather than going through intermediate wildcards and a toolkit-based reaction. Here's a decidedly incomplete sketch, which depends on the existing layout with ([R1]), ([R2]) ... in the scaffold SMILES and [R] as the first three characters of the R-group SMILES extension:
import re
from dataclasses import dataclass
from typing import Dict, List
import itertools
@dataclass
class Scaffold:
smiles: str
r_groups: Dict[int, List[str]]
SCAFFOLDS = {
# cation
# anion
# carbonyl compound
"aldehyde": Scaffold(
smiles="C([R1])(=O)", r_groups={1: ["hydrogen", "alkyl", "aryl"]}
),
"ketone": Scaffold(
smiles="C([R1])(=O)([R2])",
r_groups={1: ["alkyl", "aryl"], 2: ["alkyl", "aryl"]},
),
}
_functional_groups = {
"hydrogen": {
"hydrogen": "[R][H]",
},
# Alkanes
"alkyl": {
"methyl": "[R]C",
"ethyl": "[R]CC",
"isopropyl": "[R]C(C)(C)",
"tert-butyl": "[R]C(C)(C)(C)",
},
# Aryl
"aryl": {
"phenyl": "[R]c1ccccc1",
"pyridyl": "[R]c1ccncc1",
"benzyl": "[R]Cc1ccccc1",
},
}
FUNCTIONAL_GROUPS = {group_name: list(r_groups.values()) for (group_name, r_groups) in _functional_groups.items()}
SUBSTITUENTS = {}
for group_name, group_substituents in _functional_groups.items():
SUBSTITUENTS.update(group_substituents)
_Rn_atom_pat = re.compile(r"\(\[R([1-9])+]\)")
def enumerate_combinations(scaffold, substituents):
smiles = scaffold.smiles
parts = []
# Remove the leading '[R]' and surround by "()"s for substitution in ([R1]), etc. matches
trimmed_substituents = {}
for r_group_idx, r_groups in substituents.items():
trimmed_r_groups = []
for r_group in r_groups:
if not r_group.startswith("[R]"):
raise ValueError(f"R-groups must start with '[R]': {r_group}")
trimmed_r_groups.append("(" + r_group[3:] + ")")
trimmed_substituents[r_group_idx] = trimmed_r_groups
# Find the ([R1]) etc. matches and the intermediate strings
pos = 0
for match in _Rn_atom_pat.finditer(smiles):
start = match.start()
end = match.end()
if start > pos:
parts.append([smiles[pos:start]]) # single element substring
r_group = int(match.group(1))
parts.append(trimmed_substituents[r_group])
pos = end
if pos < len(scaffold.smiles): # Final part of the SMILES
parts.append( [scaffold.smiles[pos:]] )
for terms in itertools.product(*parts): # Enumerate the component parts
yield "".join(terms)
print(list(enumerate_combinations(
SCAFFOLDS["ketone"], {
1: ["[R]C", "[R]c1ccccc1"],
2: FUNCTIONAL_GROUPS["alkyl"],
})))
This generates:
['C(C)(=O)(C)', 'C(C)(=O)(CC)', 'C(C)(=O)(C(C)(C))', 'C(C)(=O)(C(C)(C)(C))', 'C(c1ccccc1)(=O)(C)', 'C(c1ccccc1)(=O)(CC)', 'C(c1ccccc1)(=O)(C(C)(C))', 'C(c1ccccc1)(=O)(C(C)(C)(C))']
Note that I don't touch the classification or validation in the existing code! But if you only need this chemistry-toolkit-agnostic enumeration, then perhaps something like the above, once debugged, might work in an OpenFF environment?
Thanks for taking a look @adalke! The code I'm referring to is actually the much less mature polymetrizer package that doesn't use substituents in the way Simon does, where each substituent has one R-group; rather, the input is multiple molecules with potentially multiple R-groups, which are each treated as Monomers or "residues" in a putative OpenFF residue-forcefield framework. polymetrizer
enumerates combinations of them (Oligomers) and aggregates parametrized systems to build a residue-based force field. Right now I'm using it to check if AM1BCC charges assembled this way, from parts of a larger molecule, fall within a tolerable range of charges calculated for the entire larger molecule.
You're probably right that the whole thing could be done in SMARTS/SMILES space, though. I'm kicking myself for not thinking of that earlier... and it would be faster.
No need to kick yourself. I know how long it took me to realize how much cheminformatics could be done as syntax manipulation in SMILES space rather than molecular graph manipulation in toolkit space, and I continue to come up with new ways to work directly on SMILES strings.
One that I developed a few years ago was to re-write attachment points, noted using a wildcard like *C(P)CCO
into bond closure notation like C%90(P)CCO
so components could be joined like: C%90(P)CCO.CC%90
. I call this method "welding", and used it in my matched molecular pair program mmpdb to join multiple fragments.
(Note that it's a bit trickier to handle stereochemisty correctly as moving the "*" form from the left to a closure on the right require swapping "@" and "@@" chirality, and that technique only works for tetrahedral stereochemistry .... which is all that RDKit supports. :) )
Many years ago there was a paper on using the the same idea - bond-closures + dot-disconnect - to turn combinatorial enumeration into a SMILES string combinatorial enumeration followed by canonicalization. See http://www.dalkescientific.com/writings/diary/archive/2004/12/12/library_generation_with_smiles.html where I describe the approach. (The paper was published earlier but I didn't know about it.)
Describe the bug A clear and concise description of what the bug is.
~Checking molecule equality for molecules with dummy atoms raises an error because the equality check first checks the hill_formula, which requires elements. A dummy atom (atomic number 0) does not have an element and therefore raises a KeyError.~
The toolkit looks like it supports dummy atoms, but it doesn't. It should validate the input lest this come as a surprise to users. Please ignore the remainder of the issue, which was made under the assumption that dummy atoms are supported.
To Reproduce Steps to reproduce the behavior. A minimal reproducing set of python commands is ideal.
If the problem involves a specific molecule or file, please upload that as well.
Output The full error message (may be large, that's fine. Better to paste too much than too little.)
Computing environment (please complete the following information):
conda list
Additional context Add any other context about the problem here.
Working molecule:
In general the KeyError leads to some unexpected behaviour. Could the toolkit return an element of
None
for all invalid atomic numbers?