openforcefield / openff-evaluator

A physical property evaluation toolkit from the Open Forcefield Consortium.
https://docs.openforcefield.org/projects/evaluator
MIT License
55 stars 18 forks source link

FilterBySmirks does not work with isotopes #502

Closed jthorton closed 1 year ago

jthorton commented 1 year ago

Filtering molecules with isotopes from thermoML with evaluator does not work as the openff-toolkit molecules used during the filtering don't track if the atom is an isotope or not.

Example

from openff.evaluator.datasets.curation.components.filtering import FilterBySmirks
from rdkit import Chem
mol_smiles = "[2H]OCCCCC"
smirks = "[2H]"
FilterBySmirks._find_smirks_matches(mol_smiles, smirks)

[]

rdkit_mol = Chem.MolFromSmiles(mol_smiles)
rdkit_mol.GetSubstructMatch(Chem.MolFromSmarts(smirks))

(0,)

Maybe it would be better to use rdkit for this function or provide a specific filter which can handle this using a simple search in the string like mol_smiles.find("[2H]")?

conda list openff

openff-evaluator          0.3.11+7.g48535e0           dev_0    <develop>
openff-forcefields        2.0.0              pyh6c4a22f_0    conda-forge
openff-recharge           0.1.1              pyhd8ed1ab_0    conda-forge
openff-toolkit-base       0.10.6             pyhd8ed1ab_0    conda-forge
openff-units              0.1.5              pyh6c4a22f_0    conda-forge
openff-utilities          0.1.3              pyh6c4a22f_0    conda-forge
mattwthompson commented 1 year ago

Wow - nice catch. This causes tons of problems in the toolkit, actually. It's easy to find separate cases of the toolkit ignoring the isotopes and choking on them:

>>> [atom.mass.m for atom in Molecule.from_smiles("[2H]O[2H]").atoms]
[1.007947, 15.99943, 1.007947]
>>> Molecule.from_smiles("[13C]")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mattthompson/mambaforge/envs/smirnoff-plugins-test/lib/python3.9/site-packages/openff/toolkit/topology/molecule.py", line 1807, in from_smiles
    molecule = toolkit_registry.call(
  File "/Users/mattthompson/mambaforge/envs/smirnoff-plugins-test/lib/python3.9/site-packages/openff/toolkit/utils/toolkit_registry.py", line 356, in call
    raise e
  File "/Users/mattthompson/mambaforge/envs/smirnoff-plugins-test/lib/python3.9/site-packages/openff/toolkit/utils/toolkit_registry.py", line 352, in call
    return method(*args, **kwargs)
  File "/Users/mattthompson/mambaforge/envs/smirnoff-plugins-test/lib/python3.9/site-packages/openff/toolkit/utils/rdkit_wrapper.py", line 1035, in from_smiles
    molecule = self.from_rdkit(
  File "/Users/mattthompson/mambaforge/envs/smirnoff-plugins-test/lib/python3.9/site-packages/openff/toolkit/utils/rdkit_wrapper.py", line 1756, in from_rdkit
    raise RadicalsNotSupportedError(
openff.toolkit.utils.exceptions.RadicalsNotSupportedError: The OpenFF Toolkit does not currently support parsing molecules with S- and P-block radicals. Found 4 radical electrons on molecule [13C].

Prior to having thought about this much, I wonder if rolling a custom solution here would be easier than getting isotope support into the toolkit.

Erroring out if an isotope is passed through would be an improvement in the sense that unsupported behavior (even unintentionally unsupported) is handled more gracefully, but I figure that won't actually be an improvement in getting research done.

mattwthompson commented 1 year ago

https://github.com/openforcefield/openff-toolkit/issues/974

jthorton commented 1 year ago

I wonder if rolling a custom solution here would be easier than getting isotope support into the toolkit.

I agree that a custom solution here might be easier, I wonder if all smirks matching might be better done using rdkit directly to save time as the current workflow as I understand it goes smiles -> parse with rdkit/openeye -> convert to off-Mol -> substructure search with rdkit/openeye. When filtering 70k records in thermoML this could save some time and give the correct behaviour.

mattwthompson commented 1 year ago

I was also a bit worried that Evaluator didn't depend on RDKit but I checked and it's explicitly listed; I guess the -base trick is just used to avoid AmberTools.

mattwthompson commented 1 year ago

Isotope filtering

mattwthompson commented 1 year ago

Mostly resolved with #503 / release v0.4.3. There might be performance optimizations left on the table - didn't have the time to look deep enough into that.