OpenEye Toolkit 2023.1.0 fails to assign am1bccelf10 charges that work in 2022.1.1

openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io

http://openforcefield.org

MIT License

318 stars 92 forks source link

OpenEye Toolkit 2023.1.0 fails to assign am1bccelf10 charges that work in 2022.1.1 #1736

Open ntBre opened 1 year ago

ntBre commented 1 year ago

Describe the bug The newer version of openeye-toolkits fails to assign partial charges where the previous version did so successfully. The core of the attached Python script is

from openff.toolkit.utils.toolkits import OpenEyeToolkitWrapper
oe = OpenEyeToolkitWrapper()
oe.assign_partial_charges(mol, partial_charge_method="am1bccelf10")

where mol is an openff.toolkit.Molecule.

To Reproduce Create two conda environments:

mamba create -n openeye-2022.1.1 -c openeye 'openeye-toolkits=2022.1.1' openff-qcsubmit
mamba create -n openeye-2023.1.0 -c openeye 'openeye-toolkits=2023.1.0' openff-qcsubmit

Run the included example.py script with both environments

mamba run -n openeye-2022.1.1 --no-capture-output python example.py
mamba run -n openeye-2023.1.0 --no-capture-output python example.py

Output The first one will print 52 succeeded, 0 failed, while the second will print 0 succeeded, 52 failed.

Additional context For an even smaller test set, the 2 SMILES below fail to charge in the new version but charge in the old version:

[H:11][c:1]1[c:2]([c:4]([n:9][c:5]([c:3]1[H:13])[N@:10]([C:7]([H:18])([H:19])[H:20])[C:8]([H:21])([H:22])[C:6]([H:15])([H:16])[H:17])[H:14])[H:12]
[H:16][c:1]1[c:3]([c:7]([c:11]([c:8]([c:4]1[H:19])[H:23])[C:14]([H:29])([H:30])[N@:15]([c:12]2[c:9]([c:5]([c:2]([c:6]([c:10]2[H:25])[H:21])[H:17])[H:20])[H:24])[C:13]([H:26])([H:27])[H:28])[H:22])[H:18]

And these two charge successfully in both cases:

[H:20][c:1]1[c:2]([c:8]([c:14]([c:9]([c:3]1[H:22])[H:28])[c:15]2[c:10]([c:4]([c:5]([c:11]([c:16]2[c:17]3[c:12]([c:6]([c:7]([c:13]([c:18]3[C:19]([H:33])([H:34])[H:35])[H:32])[H:26])[H:25])[H:31])[H:30])[H:24])[H:23])[H:29])[H:27])[H:21]
[H:17][c:1]1[c:2]([c:5]([c:9]([c:8]([c:4]1[H:20])[C:10]2=[C:6]([C:3](=[C:7]([C:11](=[O:15])[N:14]2[C:12]([H:24])([H:25])[H:26])[H:23])[H:19])[H:22])[O:16][C:13]([H:27])([H:28])[H:29])[H:21])[H:18]

repro.zip

mattwthompson commented 1 year ago

It looks like the error stems from conformer generation, not charge assignment per se. I wonder if any of these settings could be fiddled with? The toolkit could do a better job reporting what went wrong (or maybe Omega doesn't provide detail)?


In [6]: mol = Molecule.from_mapped_smiles("[H:16][c:1]1[c:3]([c:7]([c:11]([c:8]([c:4]1[H:19])[H:23])[C:14]([H:29])([H:30])[N@:1
   ...: 5]([c:12]2[c:9]([c:5]([c:2]([c:6]([c:10]2[H:25])[H:21])[H:17])[H:20])[H:24])[C:13]([H:26])([H:27])[H:28])[H:22])[H:18]"
   ...: )

In [7]: from openff.toolkit.utils.toolkits import OpenEyeToolkitWrapper
   ...: oe = OpenEyeToolkitWrapper()
   ...: oe.assign_partial_charges(mol, partial_charge_method="am1bccelf10")
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
---------------------------------------------------------------------------
ConformerGenerationError                  Traceback (most recent call last)
Cell In[7], line 3
      1 from openff.toolkit.utils.toolkits import OpenEyeToolkitWrapper
      2 oe = OpenEyeToolkitWrapper()
----> 3 oe.assign_partial_charges(mol, partial_charge_method="am1bccelf10")

File ~/mambaforge/envs/openff-interchange-env/lib/python3.11/site-packages/openff/toolkit/utils/openeye_wrapper.py:2420, in OpenEyeToolkitWrapper.assign_partial_charges(self, molecule, partial_charge_method, use_conformers, strict_n_conformers, normalize_partial_charges, _cls)
   2418         mol_copy._conformers = None
   2419     else:
-> 2420         self.generate_conformers(
   2421             mol_copy,
   2422             n_conformers=charge_method["rec_confs"],
   2423             rms_cutoff=0.25 * unit.angstrom,
   2424             make_carboxylic_acids_cis=True,
   2425         )
   2426         # TODO: What's a "best practice" RMS cutoff to use here?
   2427 else:
   2428     mol_copy._conformers = None

File ~/mambaforge/envs/openff-interchange-env/lib/python3.11/site-packages/openff/toolkit/utils/openeye_wrapper.py:2182, in OpenEyeToolkitWrapper.generate_conformers(self, molecule, n_conformers, rms_cutoff, clear_existing, make_carboxylic_acids_cis)
   2180     new_status = omega(oemol)
   2181     if new_status is False:
-> 2182         raise ConformerGenerationError(
   2183             "OpenEye Omega conformer generation failed"
   2184         )
   2186 molecule2 = self.from_openeye(
   2187     oemol, allow_undefined_stereo=True, _cls=molecule.__class__
   2188 )
   2190 if clear_existing:

ConformerGenerationError: OpenEye Omega conformer generation failed

ntBre commented 1 year ago

Ah yes, I should have mentioned that in the issue! The root problem is indeed the Omega conformer generation.

I also noticed that despite there being 52 records, there are only 13 unique molecules. So a smaller, complete test set is here:

[H:19][C:4]1([C:1]2=[C:2]([N:14]=[C:3]([N:15]2[C:11]([H:33])([H:34])[H:35])[N:17]3[C:8]([C:6]([C:5]([C:7]([C:9]3([H:29])[H:30])([H:25])[H:26])([H:21])[H:22])([H:23])[H:24])([H:27])[H:28])[N@:16]([C:10]([N@:18]1[C:13]([H:39])([H:40])[H:41])([H:31])[H:32])[C:12]([H:36])([H:37])[H:38])[H:20]
[H:19][C:4]1([C:1]2=[C:2]([N:13]=[C:3]([N:14]2[C:10]([H:31])([H:32])[H:33])[N:17]3[C:5]([C:7]([N:18]([C:8]([C:6]3([H:23])[H:24])([H:27])[H:28])[C:12]([H:37])([H:38])[H:39])([H:25])[H:26])([H:21])[H:22])[N@:16]([C:9]([N:15]1[H:40])([H:29])[H:30])[C:11]([H:34])([H:35])[H:36])[H:20]
[H:18][c:1]1[c:2]([n:14][c:4]([c:3]([n:13]1)[N@:15]2[C:5]([C@@:7]([C@:8]([C:6]2([H:22])[H:23])([C:10]([H:28])([H:29])[H:30])[O:17][H:37])([H:24])[C:9]([H:25])([H:26])[H:27])([H:20])[H:21])[N:16]([C:11]([H:31])([H:32])[H:33])[C:12]([H:34])([H:35])[H:36])[H:19]
[H:26][c:2]1[c:4]([c:3]([c:6]([n:22][c:5]1[C:19]([H:47])([H:48])[H:49])[N@@:23]2[C:15]([C@@:17]3([C:7](=[O:25])[N:24]([C:13]([C:8]([C:11]3([H:33])[H:34])([H:27])[H:28])([H:37])[H:38])[C:20]([H:50])([H:51])[C:16]4([C:9]([C:10]4([H:31])[H:32])([H:29])[H:30])[H:43])[C:12]([C:14]2([H:39])[H:40])([H:35])[H:36])([H:41])[H:42])[C:1]#[N:21])[C:18]([H:44])([H:45])[H:46]
[H:21][c:1]1[n:14][c:7]([c:2]2[c:8]([n:15]1)[O:20][N:16]=[C:4]2[C:9]([H:22])([H:23])[H:24])[N@:18]([C:12]([H:31])([H:32])[H:33])[C:13]([H:34])([H:35])[C:3]3=[C:6]([O:19][N:17]=[C:5]3[C:10]([H:25])([H:26])[H:27])[C:11]([H:28])([H:29])[H:30]
[H:23][c:1]1[c:2]([c:4]([n:16][c:5]([c:3]1[H:25])[N@:17]2[C:12]([C@:13]3([C:6](=[O:19])[N:18]([C:10]([C:7]([C:8]3([H:28])[H:29])([H:26])[H:27])([H:32])[H:33])[C:14]([H:38])([H:39])[H:40])[C:9]([C:11]2([H:34])[H:35])([H:30])[H:31])([H:36])[H:37])[C:15]([F:20])([F:21])[F:22])[H:24]
[H:26][c:2]1[c:3]([c:6]([n:22][c:5]([c:4]1[C:18]([H:44])([H:45])[H:46])[C:19]([H:47])([H:48])[H:49])[N@@:23]2[C:15]([C@@:17]3([C:7](=[O:25])[N:24]([C:13]([C:8]([C:11]3([H:33])[H:34])([H:27])[H:28])([H:37])[H:38])[C:20]([H:50])([H:51])[C:16]4([C:9]([C:10]4([H:31])[H:32])([H:29])[H:30])[H:43])[C:12]([C:14]2([H:39])[H:40])([H:35])[H:36])([H:41])[H:42])[C:1]#[N:21]
[H:25][c:1]1[c:2]([c:8]([c:12]([c:9]([c:3]1[H:27])[H:33])[N@@:22]([C:20]([H:44])([H:45])[H:46])[C:21]([H:47])([H:48])[c:11]2[c:6]([c:4]([c:10]([c:5]([c:7]2[H:31])[H:29])[C@:18]3([C@@:19]([C:17]([C:15]([C:14]([C:16]3([H:38])[H:39])([H:34])[H:35])([H:36])[H:37])([H:40])[H:41])([H:43])[C:13](=[O:23])[O:24][H:49])[H:42])[H:28])[H:30])[H:32])[H:26]
[H:25][c:1]1[c:2]([c:8]([c:12]([c:9]([c:3]1[H:27])[H:33])[N@@:22]([C:20]([H:44])([H:45])[H:46])[C:21]([H:47])([H:48])[c:11]2[c:6]([c:4]([c:10]([c:5]([c:7]2[H:31])[H:29])[C@:18]3([C@@:19]([C:17]([C:15]([C:14]([C:16]3([H:38])[H:39])([H:34])[H:35])([H:36])[H:37])([H:40])[H:41])([H:43])[C:13](=[O:24])[O-:23])[H:42])[H:28])[H:30])[H:32])[H:26]
[H:20][c:1]1[c:3]([n:16][c:7]([c:5]2[c:6]1[O:19][C:4](=[C:2]2[H:21])[H:23])[N@@:18]([C:12]([H:32])([H:33])[H:34])[C:15]([H:39])([H:40])[C:13]([H:35])([H:36])[C:14]([H:37])([H:38])[N:17]3[C:10]([C:8]([C:9]([C:11]3([H:30])[H:31])([H:26])[H:27])([H:24])[H:25])([H:28])[H:29])[H:22]
[H:20][c:1]1[c:3]([n:16][c:7]([c:5]2[c:6]1[O:19][C:4](=[C:2]2[H:21])[H:23])[N@@:17]([C:12]([H:32])([H:33])[H:34])[C:14]([H:37])([H:38])[C:13]([H:35])([H:36])[C:15]([H:39])([H:40])[N+:18]3([C:10]([C:8]([C:9]([C:11]3([H:30])[H:31])([H:26])[H:27])([H:24])[H:25])([H:28])[H:29])[H:41])[H:22]
[H:11][c:1]1[c:2]([c:4]([n:9][c:5]([c:3]1[H:13])[N@:10]([C:7]([H:18])([H:19])[H:20])[C:8]([H:21])([H:22])[C:6]([H:15])([H:16])[H:17])[H:14])[H:12]
[H:16][c:1]1[c:3]([c:7]([c:11]([c:8]([c:4]1[H:19])[H:23])[C:14]([H:29])([H:30])[N@:15]([c:12]2[c:9]([c:5]([c:2]([c:6]([c:10]2[H:25])[H:21])[H:17])[H:20])[H:24])[C:13]([H:26])([H:27])[H:28])[H:22])[H:18]

ntBre commented 1 year ago

I tried adjusting some of these settings to no avail. I tried slapping some zeros on things (SetEnergyWindow(1500.0), SetRMSThreshold(100.0)), setting both of these options to zero, and even commenting out all of the omega.* calls hoping to get the default values. It appears Omega just returns a bool saying whether or not it succeeded, so there's not much more that can be done in the toolkit.

mattwthompson commented 1 year ago

I don't actually know what the warning means, or how severe it is in OpenEye's jargon

Warning: : Failed to build structure from CT

except that CT is the connection table

j-wags commented 1 year ago

~CT probably means "carbon, tetrahedral" or something like that. I see it a lot in atom-typing-land.~ (Whoops, I spoke too quickly. Disregard this part but the rest of the comment stands)

Glancing at this quickly I don't see anything hugely obvious. My remaining possibilities are roughly 50% this being a representation/molecule sanitization issue with us that OE has started policing (maybe we're doing something bad in to_openeye), and 50% this being a regression in OE.

If we can make a reproducing case with an un-mapped version of one of these SMILES using pure OpenEye code, then it'll both be clear that it's a regression on their side, and we'll have a repro example we can immediately send to them. I don't have time to do this today but if either of you do, this is the route I'd recommend for debugging!

j-wags commented 1 year ago

Oh, actually, I take back the CT thing. It probably does mean connection table in this context.

ntBre commented 1 year ago

Here are the un-mapped SMILES. I'll try to reproduce with just OpenEye code.

[H]C1(C2=C(N=C(N2C([H])([H])[H])N3C(C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[N@](C([N@]1C([H])([H])[H])([H])[H])C([H])([H])[H])[H]
[H]C1(C2=C(N=C(N2C([H])([H])[H])N3C(C(N(C(C3([H])[H])([H])[H])C([H])([H])[H])([H])[H])([H])[H])[N@](C(N1[H])([H])[H])C([H])([H])[H])[H]
[H]c1c(nc(c(n1)[N@]2C([C@@]([C@](C2([H])[H])(C([H])([H])[H])O[H])([H])C([H])([H])[H])([H])[H])N(C([H])([H])[H])C([H])([H])[H])[H]
[H]c1c(c(c(nc1C([H])([H])[H])[N@@]2C([C@@]3(C(=O)N(C(C(C3([H])[H])([H])[H])([H])[H])C([H])([H])C4(C(C4([H])[H])([H])[H])[H])C(C2([H])[H])([H])[H])([H])[H])C#N)C([H])([H])[H]
[H]c1nc(c2c(n1)ON=C2C([H])([H])[H])[N@](C([H])([H])[H])C([H])([H])C3=C(ON=C3C([H])([H])[H])C([H])([H])[H]
[H]c1c(c(nc(c1[H])[N@]2C([C@]3(C(=O)N(C(C(C3([H])[H])([H])[H])([H])[H])C([H])([H])[H])C(C2([H])[H])([H])[H])([H])[H])C(F)(F)F)[H]
[H]c1c(c(nc(c1C([H])([H])[H])C([H])([H])[H])[N@@]2C([C@@]3(C(=O)N(C(C(C3([H])[H])([H])[H])([H])[H])C([H])([H])C4(C(C4([H])[H])([H])[H])[H])C(C2([H])[H])([H])[H])([H])[H])C#N
[H]c1c(c(c(c(c1[H])[H])[N@@](C([H])([H])[H])C([H])([H])c2c(c(c(c(c2[H])[H])[C@]3([C@@](C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])C(=O)O[H])[H])[H])[H])[H])[H]
[H]c1c(c(c(c(c1[H])[H])[N@@](C([H])([H])[H])C([H])([H])c2c(c(c(c(c2[H])[H])[C@]3([C@@](C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])C(=O)[O-])[H])[H])[H])[H])[H]
[H]c1c(nc(c2c1OC(=C2[H])[H])[N@@](C([H])([H])[H])C([H])([H])C([H])([H])C([H])([H])N3C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])[H]
[H]c1c(nc(c2c1OC(=C2[H])[H])[N@@](C([H])([H])[H])C([H])([H])C([H])([H])C([H])([H])[N+]3(C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])[H])[H]
[H]c1c(c(nc(c1[H])[N@](C([H])([H])[H])C([H])([H])C([H])([H])[H])[H])[H]
[H]c1c(c(c(c(c1[H])[H])C([H])([H])[N@](c2c(c(c(c(c2[H])[H])[H])[H])[H])C([H])([H])[H])[H])[H]

ntBre commented 1 year ago

This fails with 2023.1.0 and works with 2022.1.1:

from openeye import oechem, oeomega

smiles = "[H]C1(C2=C(N=C(N2C([H])([H])[H])N3C(C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[N@](C([N@]1C([H])([H])[H])([H])[H])C([H])([H])[H])[H]"

# steps taken from
# https://docs.eyesopen.com/toolkits/python/oechemtk/molctordtor.html#construction-from-smiles
mol = oechem.OEMol()
assert oechem.OESmilesToMol(mol, smiles)

omega = oeomega.OEOmega()
assert omega(mol)

ntBre commented 1 year ago

And checking all of them with the SMILES pasted into smiles.dat:

from openeye import oechem, oeomega

win = 0
with open('smiles.dat') as inp:
    for smiles in inp:
        mol = oechem.OEMol()
        assert oechem.OESmilesToMol(mol, smiles[:-1])

        omega = oeomega.OEOmega()
        win += omega(mol)

print(win)

prints 13 for 2022.1.1 and 0 for 2023.1.0.

mattwthompson commented 1 year ago

It looks like each molecule contains a chiral nitrogen bound to an aromatic ring. Most of these are bound to the carbon between nitrogens in pyrimidine, but this one doesn't have that

C1(=C(C(=C(C(=C1[H])[H])N(C([H])([H])[H])C([H])([H])C2=C(C(=C(C(=C2[H])[H])[C@H]3[C@H](C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])C(=O)O[H])[H])[H])[H])[H])[H]

I'd expect much more than 13 structures (of a few thousand?) to have a nitrogen near a ring, at least with my experience in these datasets I get the impression that aromatic rings are ubiquitous and chiral nitrogens are common

ntBre commented 1 year ago

I'm still running the benchmark to see if this is the root cause of the differences I've observed between the original Sage 2.1.0 fit and my attempt to reproduce it. I expected there to be more than 13 molecules too, but those are the only records I found in the opt-set-for-fitting-2.1.0.json in the Sage-2.1.0 repo. There are 5580 records therein, but only 1701 when converting the list of record.cmiles entries to a set. So it's something like 13/1701 I guess.

This must not account for all of the difference I've seen because I have 23 fewer opt-geo batches than the Sage 2.1.0 repo, which should contain ~600 structures total. But this conformer generation error definitely prevented me from running ForceBalance on the Sage 2.1.0 inputs with my environment.

j-wags commented 1 year ago

chiral nitrogens are common

I wonder if this is the issue. "Which nitrogens are chiral?" is an open question and different cheminformatics toolkits don't always agree. In particular, some recent omega releasenotes say:

Conformer generation with OEOmega.Build or OEMolBuilder.Build now fails correctly when specified stereo signatures on a molecule cannot be honored. The existing behavior, to carry on even if stereo signatures could not be honored, can be obtained by setting IgnoreStereo to true.

So maybe it's that Omega is getting strict here - for example N@ and N@@ must now be honored in the output, but Omega's internal FF/geometry generator for some reason always makes those planar, so it can never match either stereochemistry.

The releasenotes mention that we can set the (ominously-named) IgnoreStereo to True to recover the old behavior. Could you give that a shot?

ntBre commented 1 year ago

I'm not sure if I'm using the Builder correctly, but this does fail to run on 2022.1.1 for AttributeError: 'OEMolBuilderOptions' object has no attribute 'SetIgnoreStereo', so that change must have occurred between the two.

from openeye import oechem, oeomega

win = 0
with open("smiles.dat") as inp:
    for smiles in inp:
        mol = oechem.OEMol()
        assert oechem.OESmilesToMol(mol, smiles[:-1])

        builder = oeomega.OEMolBuilder()
        options = oeomega.OEMolBuilderOptions()
        options.SetIgnoreStereo(True)
        builder.SetOptions(options)

        assert builder.Build(mol) == oeomega.OEOmegaReturnCode_Success

        omega = oeomega.OEOmega()
        assert omega.Build(mol) == oeomega.OEOmegaReturnCode_FailedCTBuild
        win += omega(mol)

print(win)

Edit: This fails for every conformer in the 2023 version too. Sorry I forgot to mention that originally! I brought up the 2022 performance to show that I was using new code (SetIgnoreStereo) but forgot to say that it doesn't change the 2023 result (still fails to generate conformers, as seen in the return code assertion).

I have also now tried the latest release

mamba create -n openeye-2023.1.1 -c openeye 'openeye-toolkits=2023.1.1' openff-qcsubmit

with the same output:

Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
Warning: : Failed to build structure from CT
0

davidlmobley commented 1 year ago

I wonder if this is an OpenEye support issue; if the OpenEye toolkit previously could generate conformers for these but now it cannot, is that on their end?

ntBre commented 1 year ago

I updated my last comment with some additional details and just sent an email to OpenEye support!

ntBre commented 9 months ago

Should I close this? Based on the response from OpenEye support, it sounds like a known/expected change from their perspective. And unless we want to propagate the OEMolBuilderOptions or OEOmegaOptions in their code examples, it's not really a problem with our toolkit.

I'm looking through some of my opened issues today and thought I would close some if they are no longer relevant. I'll leave it if it's useful to have around, though.

mattwthompson commented 9 months ago

I think so; if we're not likely to do anything about it and if it's too esoteric to document here, there is no reason to keep this open

j-wags commented 2 months ago

@amcisaac just asked whether we should be using OE 2022 for our work since it can charge more molecules. This is a complex topic, and while previous discussion talked about the CAUSE of the issue (and concluded that it's intentional from OE's end), it didn't resolve what we should DO about it (like, does this have a big enough impact on FF fitting/application that we should implement a workaround)

I think the discussion above this can be recapped as

The OpenEyeToolkitWrapper using OE 2023 and newer can't assign charges to some molecules because of changes in Omega.
These changes are due to newer versions of Omega being stricter about expecting stereochemistry defined on some nitrogens
OE sees these changes as an improvement and doesn't plan to roll them back
@ntBre found that the workaround OE suggested for recovering the old behavior (SetIgnoreStereo(True)) doesn't work (?)
The issue was closed because this is an intentional change in OE

This may have been discussed elsewhere, but I wasn't able to easily find it, so I'd like to confirm on this issue:

Does SetIgnoreStereo(True) really not work to recover the old behavior? (I think @ntBre's original comment indicates this doesn't work)
Does this change affect FF fitting significantly?
If so, are the changes to FF fitting clearly good/bad?
Should we have our infrastructure provide the old behavior?

Kinda a messy science/infra crossover question here. I'd be interested in thoughts from @lilyminium @amcisaac @ntBre on the science impact. Could be as simple as "doesn't really affect fitting, not a big deal", but it'd just be good to have that written down somewhere.

lilyminium commented 2 months ago

Thanks @j-wags for the summary. My current thoughts are below where I think I've talked myself into upgrading, but very keen to hear @amcisaac and @ntBre's takes on it too.

Generally I'm in favour of keeping up-to-date with packages
OpenEye Omega should fail when stereochemistry cannot be honoured and the output conformer might not match the input, so IMO that is an improvement.
Only generating planar conformers is difficult, but a quick check of RDKit indicates their conformer generation algorithm also generates planar geometries for the molecules @ntBre highlighted, so not an easy fallback
We have plenty of chiral Ns bound to aromatic rings that do pass, but a quick non-systematic skim indicates they don't have stereochemistry specified in the smiles

Does SetIgnoreStereo(True) really not work to recover the old behavior? (I think @ntBre's original comment indicates this doesn't work)

When OpenEye support got back to us they also suggested enumerating nitrogens -- does that recover the old behaviour in this case as well?

Does this affect FF fitting significantly? If so, are the changes to FF fitting clearly good/bad?

Assigning charges from potentially inverted geometries with OpenEye 2022, depending on how the AM1 optimization goes, is suboptimal from a quality standpoint. From a consistency standpoint however I assume that's currently what happens when we don't specify stereochemistry with the other molecules, including with OE 2023
Eliminating all molecules with this group also means possibly removing a significant group of chemistry
Also, IIRC the aromatic ring would typically lower the barrier to inversion
For this general group I think I've talked myself into a position that using charges from planar / N with inverted stereochemistry shouldn't be too big a difference to our current workflow, but very open to other takes. One experiment we could do is to calculate single-conf AM1BCC charges from input geometries with stereochemistries to see how much charges differ, but with our current fitting workflow we are working off omega geometries.
Whether we require these particular molecules in FF training data (and therefore need a workaround) depends how unique they are, or we can use other similar data where the SMILES just hasn't specified stereochemistry -- @ntBre or @amcisaac might you have a feel for this already?

Should we have our infrastructure provide the old behavior?

Could one way to do this be to fall back on RDKit conformer generation when OpenEye refuses?

ntBre commented 2 months ago

I don't have very strong feelings about this, but I would lean toward just updating. OpenEye support confirmed that it was intended behavior, and we can't really pin to 2022 forever. This affects a small number of molecules (that sound pretty weird anyway), so I don't think the net effect on fitting is substantial if even noticeable. I think I opened this when I was still debugging why I couldn't reproduce the Sage 2.1 fit and thought this might be a contributing factor, but that turned out to be a different issue.

j-wags commented 2 months ago

Could one way to do this be to fall back on RDKit conformer generation when OpenEye refuses?

I think this would be a tough solution since the current design philosophy is that the OpenFF Toolkit can run with RDKit OR OpenEye, so adding some dependence (even optional) between them will increase complexity and diminish reproducibility.

amcisaac commented 2 months ago

In my opinion I think we have to upgrade, as others have said we can't pin to a 2022 version forever, and I suspect people may not know to pin to it or may forget why it is pinned and try to unpin it to resolve other conda incompatibilities (like me 😅 which is why I re-opened it ).

I don't have a good sense about whether these molecules are unique or how it would affect training. If we are worried that the upgrade would be eliminating a large chunk of chemistry, I would be happy to assess how similar these molecules are to others in our training set. If it would be eliminating some unique molecules that we do want to keep, we could always generate new data for those functional groups that don't have the chiral N problem (unless the chiral N is inherent to those chemistries, I suppose...).

j-wags commented 2 months ago

Some good news - I've tested @ntBre's 13 unique molecules in the latest version of OpenEye and they appear to work.

OpenEye Toolkits 2022.1: All 13 pass
OpenEye Toolkits 2023.1: All fail
OpenEye Toolkits 2024.1: All 13 pass
OpenEye Toolkit 2024.1, called through OFFTK's OpenEyeToolkitWrapper: All 13 pass

So I think we just shouldn't use OpenEye Toolkits 2023 for fitting work, and that this issue can be closed :-)

lilyminium commented 2 months ago

Great, totally happy to upgrade -- thanks @j-wags for looking into it!

amcisaac commented 1 month ago

I'm having (maybe) the same issue for the SMILES [H][C@]12[C@](C1([H])C(=O)N3C(C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])([C@]4(C([C@@]2(C(C4([H])[H])([H])[H])[H])([H])[H])[H])[H] with openeye-toolkits 2024.1.1 that didn't have any errors with 2022.1. I don't see a chiral nitrogen in this SMILES string, so it's possible it's a different conformer generation issue. This is from the industry benchmark dataset. Just giving a heads up that anyone doing benchmarking using the Sage 2.2.0 dataset may want to keep pinning to 2022 Open Eye. In the mean time, I'll do another filter and see how many other molecules are affected there (this is just the one that caused my run to crash).

ValueError: No registered toolkits can provide the capability "assign_partial_charges" for args "()" and kwargs "{'molecule': Molecule with name '' and SMILES '[H][C@]12[C@](C1([H])C(=O)N3C(C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])([C@]4(C([C@@]2(C(C4([H])[H])([H])[H])[H])([H])[H])[H])[H]', 'partial_charge_method': 'am1bccelf10', 'use_conformers': None, 'strict_n_conformers': False, 'normalize_partial_charges': True, '_cls': <class 'openff.toolkit.topology.molecule.Molecule'>}"
Available toolkits are: [ToolkitWrapper around OpenEye Toolkit version 2024.1.1, ToolkitWrapper around The RDKit version 2024.03.5, ToolkitWrapper around AmberTools version 23.6, ToolkitWrapper around Built-in Toolkit version None]
 ToolkitWrapper around OpenEye Toolkit version 2024.1.1 <class 'openff.toolkit.utils.exceptions.ConformerGenerationError'> : OpenEye Omega conformer generation failed
 ToolkitWrapper around The RDKit version 2024.03.5 <class 'openff.toolkit.utils.exceptions.ChargeMethodUnavailableError'> : partial_charge_method 'am1bccelf10' is not available from RDKitToolkitWrapper. Available charge methods are {'mmff94': {}, 'gasteiger': {}}
 ToolkitWrapper around AmberTools version 23.6 <class 'openff.toolkit.utils.exceptions.ChargeMethodUnavailableError'> : partial_charge_method 'am1bccelf10' is not available from AmberToolsToolkitWrapper. Available charge methods are {'am1bcc': {'antechamber_keyword': 'bcc', 'min_confs': 1, 'max_confs': 1, 'rec_confs': 1}, 'am1-mulliken': {'antechamber_keyword': 'mul', 'min_confs': 1, 'max_confs': 1, 'rec_confs': 1}, 'gasteiger': {'antechamber_keyword': 'gas', 'min_confs': 0, 'max_confs': 0, 'rec_confs': 0}}
 ToolkitWrapper around Built-in Toolkit version None <class 'openff.toolkit.utils.exceptions.ChargeMethodUnavailableError'> : Partial charge method "am1bccelf10"" is not supported by the Built-in toolkit. Available charge methods are {'zeros': {'rec_confs': 0, 'min_confs': 0, 'max_confs': 0}, 'formal_charge': {'rec_confs': 0, 'min_confs': 0, 'max_confs': 0}}

amcisaac commented 1 month ago

After re-filtering the benchmark dataset, it looks like just this one molecule was affected, which has 6 conformers. The mapped SMILES that is output by our charge filtering script is [C:1]1([H:17])([H:18])[C:2]([H:19])([H:20])[C:6]([H:27])([H:28])[N:15]([C:14]([C@:13]2([H:37])[C@@:11]3([H:35])[C@@:9]4([H:33])[C:4]([H:23])([H:24])[C:5]([H:25])([H:26])[C@:10]([H:34])([C:8]4([H:31])[H:32])[C@:12]32[H:36])=[O:16])[C:7]([H:29])([H:30])[C:3]1([H:21])[H:22] which appears to generate an identical visualization as the above SMILES string. My intuition would be that it's not a big deal to upgrade based on this, as it's just one molecule out of ~10,000 and doesn't appear to have any particularly unique chemistry, but we do need to re-filter our benchmark dataset.