openforcefield / yammbs

Internal tool for benchmarking force fields
MIT License
3 stars 1 forks source link

Inconsistent atom order between `Molecule.from_mapped_smiles` and `Molecule.from_inchi` affecting TFD results #45

Closed amcisaac closed 6 months ago

amcisaac commented 7 months ago

While comparing the YAMMBS benchmark of Sage 2.1 to our old benchmarks from the Sage 2.1 release, I noticed the TFD's are very off, and notably some of the YAMMBS TFD results are greater than 1, which I believe is supposed to be impossible.

Screen Shot 2024-05-06 at 5 37 15 PM

Screen Shot 2024-05-06 at 5 37 25 PM

The worst offender is QCAID 43421364. Here, the old Sage 2.1 benchmarks give 0.085 for the TFD, whereas YAMMBS gives 2.07.

I believe this is coming from the molecule creation in line 721, in store.get_tfd here

When I create a molecule from the InChI key (as in the YAMMBS code), I get a different order of atoms than when I create it from the mapped smiles, although both were generated by/retrieved from the SQLITE by YAMMBS:

from openff.toolkit import Molecule

inchi_molecule = Molecule.from_inchi(inchi_key, allow_undefined_stereo=True)
smiles_molecule = Molecule.from_mapped_smiles(smiles,allow_undefined_stereo=True)
inchi_molecule.atoms == smiles_molecule.atoms # outputs False

For all of the molecules in my sqlite database, the atom order between the InChI key and mapped smiles are inconsistent.

If I use the smiles_molecule to calculate TFD, I get ~0.085, in agreement with the old benchmarking code:

from yammbs.analysis import get_tfd

smiles_tfd = get_tfd(smiles_molecule, qm_conf, mm_conf) # 0.085
inchi_tfd = get_tfd(inchi_molecule,qm_conf,mm_conf) # 2.07

If I add the QM geometry and visualize it, it appears that the smiles_molecule has the correct atom order/connectivity, whereas the inchi_molecule is not correct:

inchi_molecule.add_conformer(qm_conf)
inchi_molecule.visualize()

Screen Shot 2024-05-06 at 5 24 52 PM

smiles_molecule.add_conformer(qm_conf)
smiles_molecule.visualize()

Screen Shot 2024-05-06 at 5 24 46 PM

This suggests to me that the unphysical value of TFD is coming from the inconsistent atom ordering, and I'd propose using the mapped SMILES to create the molecule instead of the InChI key:

for smiles in store.get_smiles():
    molecule = Molecule.from_mapped_smiles(smiles, allow_undefined_stereo=True)
    molecule_id = store.get_molecule_id_by_smiles(smiles)

I'd be happy to make a PR to make this change if it is this simple, but it seems like something that might have cascading effects or indicate a problem somewhere else.

This shouldn't affect the ddE as it looks like that never creates a Molecule. However, both the RMSD and ICRMSD define their Molecule objects the same way using the InChI key. I wouldn't expect it to affect the RMSD, as it looks like the Molecule is mostly ignored and the RMSD is calculated from the geometry arrays directly. I do think it should affect the ICRMSDs, as those appear to use the Molecule object directly. When I tested it out, I got slightly different ICRMSD values if I used a molecule created from SMILES vs InChI key, but the effect was much smaller (e.g. 0.013 A vs 0.014 A error in bond length for this sample molecule).

Structures and other info required to reproduce: InChI key (generated by YAMMBS from my Sage 2.1 SQLITE file) inchi_key = 'InChI=1/C20H20ClFN2O4/c1-28-17-4-2-13(3-5-17)12-24-7-6-20(27,19(24)26)18(25)23-11-14-8-15(21)10-16(22)9-14/h2-5,8-10,27H,6-7,11-12H2,1H3,(H,23,25)/t20-/m0/s1/f/h23H'

Mapped SMILES (generated by YAMMBS from my Sage 2.1 SQLITE file) smiles = '[H:33][c:5]1[c:4]([c:3]([c:28]([c:27]([c:6]1[C:7]([H:34])([H:35])[N:8]2[C:25](=[O:26])[C@:11]([C:10]([C:9]2([H:36])[H:37])([H:38])[H:39])([C:13](=[O:14])[N:15]([H:41])[C:16]([H:42])([H:43])[c:17]3[c:18]([c:19]([c:21]([c:22]([c:24]3[H:46])[Cl:23])[H:45])[F:20])[H:44])[O:12][H:40])[H:47])[H:48])[O:2][C:1]([H:29])([H:30])[H:31])[H:32]'

Conformers:


import numpy as np
from openff.units import unit

qm_conf = unit.Quantity(np.array([[-3.76495674e+00,  4.28662539e+00, -1.67210996e+00],
       [-2.53719841e+00,  4.65648759e+00, -1.05213185e+00],
       [-1.75871491e+00,  3.66004938e+00, -5.29789456e-01],
       [-2.07746549e+00,  2.29739661e+00, -5.68842398e-01],
       [-1.19662590e+00,  1.36769418e+00,  2.76352361e-03],
       [-1.00870533e-02,  1.76976772e+00,  6.15193994e-01],
       [ 9.53828963e-01,  7.69277568e-01,  1.21523896e+00],
       [ 2.21997088e+00,  7.26101883e-01,  4.92242316e-01],
       [ 2.34101491e+00,  1.83325563e-01, -8.63606932e-01],
       [ 3.73013021e+00,  6.71852810e-01, -1.32689876e+00],
       [ 4.50221927e+00,  8.99734273e-01, -1.82715390e-02],
       [ 5.39817584e+00,  2.00153048e+00, -4.97209655e-02],
       [ 5.19198131e+00, -3.94094972e-01,  4.61032925e-01],
       [ 4.58792533e+00, -1.46829707e+00,  4.56562731e-01],
       [ 6.46067015e+00, -2.23891873e-01,  9.07932764e-01],
       [ 7.17981161e+00, -1.25981032e+00,  1.62296890e+00],
       [ 7.27033975e+00, -1.00803604e+00,  3.12087121e+00],
       [ 8.32659145e+00, -1.57989347e+00,  3.84190473e+00],
       [ 8.38200486e+00, -1.38746329e+00,  5.21687311e+00],
       [ 9.40676713e+00, -1.94075352e+00,  5.91329983e+00],
       [ 7.43561822e+00, -6.43610821e-01,  5.91348776e+00],
       [ 6.39678847e+00, -8.09063113e-02,  5.17034372e+00],
       [ 5.18041585e+00,  8.66111263e-01,  6.00558129e+00],
       [ 6.29907795e+00, -2.50443652e-01,  3.78789155e+00],
       [ 3.37623578e+00,  1.19457794e+00,  9.92350818e-01],
       [ 3.57439178e+00,  1.79854214e+00,  2.05066959e+00],
       [ 2.95677361e-01,  3.14263845e+00,  6.45368179e-01],
       [-5.62621022e-01,  4.07787971e+00,  8.21143256e-02],
       [-4.22218317e+00,  5.21822557e+00, -2.00793009e+00],
       [-4.43632527e+00,  3.78742387e+00, -9.62494048e-01],
       [-3.59535878e+00,  3.63300754e+00, -2.53681394e+00],
       [-2.99191283e+00,  1.94659816e+00, -1.03355083e+00],
       [-1.45178578e+00,  3.10440581e-01, -3.40149112e-02],
       [ 1.19848409e+00,  1.03046427e+00,  2.24802283e+00],
       [ 5.17290520e-01, -2.35592323e-01,  1.21349370e+00],
       [ 2.28754151e+00, -9.08764607e-01, -8.33749396e-01],
       [ 1.53131982e+00,  5.71023146e-01, -1.48905229e+00],
       [ 4.21496664e+00, -4.40696781e-02, -1.99181582e+00],
       [ 3.64866066e+00,  1.63608424e+00, -1.83471498e+00],
       [ 5.20005945e+00,  2.52400289e+00,  7.53620679e-01],
       [ 6.83309949e+00,  7.17548436e-01,  8.57133234e-01],
       [ 6.64401401e+00, -2.19523842e+00,  1.43827236e+00],
       [ 8.18606105e+00, -1.36663368e+00,  1.20454764e+00],
       [ 9.09880134e+00, -2.16689696e+00,  3.35242944e+00],
       [ 7.51564839e+00, -5.10026677e-01,  6.98659793e+00],
       [ 5.47676862e+00,  2.08418056e-01,  3.24900210e+00],
       [ 1.21610478e+00,  3.47397977e+00,  1.12069094e+00],
       [-3.35722031e-01,  5.13981423e+00,  1.06205102e-01]]),unit.angstrom)

mm_conf = unit.Quantity(np.array([[-2.93941468,  4.94194425, -2.2750501 ],
       [-1.98040127,  5.10584468, -1.21860862],
       [-1.46887625,  3.94297119, -0.6421446 ],
       [-1.81079546,  2.63056528, -1.04601861],
       [-1.24014551,  1.51765026, -0.40385674],
       [-0.32889163,  1.69992556,  0.6520173 ],
       [ 0.31403414,  0.50625809,  1.32568923],
       [ 1.68935965,  0.31929348,  0.81382202],
       [ 1.99137823, -0.24736546, -0.50929058],
       [ 3.3806262 ,  0.34749953, -0.83300547],
       [ 4.03345827,  0.66538033,  0.53739156],
       [ 4.82371158,  1.88866114,  0.45577577],
       [ 4.84831115, -0.54168448,  1.05971962],
       [ 4.30232221, -1.55796385,  1.46258335],
       [ 6.22498326, -0.52620046,  0.8919976 ],
       [ 7.11968402, -1.52624042,  1.51632954],
       [ 7.45940874, -1.08779598,  2.92398745],
       [ 8.62253491, -0.33339471,  3.1738528 ],
       [ 8.91346295,  0.10497112,  4.47542091],
       [10.032931  ,  0.82305736,  4.7101999 ],
       [ 8.04029792, -0.19786659,  5.53076437],
       [ 6.87172195, -0.93558213,  5.2857365 ],
       [ 5.79418523, -1.29484226,  6.57927334],
       [ 6.5789169 , -1.37558819,  3.98504298],
       [ 2.81374317,  0.80921233,  1.44903573],
       [ 2.82071897,  1.4837843 ,  2.46958637],
       [ 0.02106704,  3.00204506,  1.04976369],
       [-0.55023492,  4.11130006,  0.40524586],
       [-3.26453669,  5.92087131, -2.62779054],
       [-3.82099067,  4.40486336, -1.92198737],
       [-2.50079462,  4.41144824, -3.12140978],
       [-2.5082229 ,  2.44433491, -1.8489525 ],
       [-1.50633839,  0.51951719, -0.7276208 ],
       [ 0.32586202,  0.65055362,  2.40826422],
       [-0.27266272, -0.39603087,  1.14708273],
       [ 2.04236922, -1.33629068, -0.44877394],
       [ 1.25997399,  0.05142846, -1.26197947],
       [ 3.98198989, -0.33973441, -1.42967176],
       [ 3.24914257,  1.27518784, -1.39276143],
       [ 4.81066541,  2.38210334,  1.31342241],
       [ 6.58803111,  0.3827858 ,  0.60265219],
       [ 6.65613258, -2.5152683 ,  1.52718149],
       [ 8.03022748, -1.62517641,  0.92335316],
       [ 9.30431504, -0.0823512 ,  2.37261632],
       [ 8.26595698,  0.14317298,  6.53152009],
       [ 5.66609988, -1.93074893,  3.80429037],
       [ 0.73460951,  3.15280049,  1.85152844],
       [-0.28242741,  5.1128938 ,  0.71297503]]),unit.angstrom)
mattwthompson commented 6 months ago

Thanks for the thorough report -

I'd propose using the mapped SMILES to create the molecule instead of the InChI key:

Seems like a good idea to me

mattwthompson commented 6 months ago

Which dataset is the molecule from? Is there another dataset that would be useful for testing this against #46?

amcisaac commented 6 months ago

This molecule is from the Industry Benchmark dataset. I can check out your PR and see how it performs on the whole dataset, if you want. I wrote my own function with the code I suggested above (slightly different than what you implemented, but I think it should have the same effect), and it corrected the issue across the whole industry benchmark dataset. It would take a couple of hours to re-compute the TFD's with your PR and see (presumably) the same improvement

image

Alternatively I can look for other molecules in this dataset to use as a case study, or run a benchmark on a smaller and more manageable dataset to use for testing/debugging. Let me know what would be most helpful

mattwthompson commented 6 months ago

If you don't mind using that branch, I'll take you up on that. I figure the chances of it working as hoped are high, but I'd like to see the test results before tagging a release with it included

amcisaac commented 6 months ago

This was the result. Looks like it fixes it, and fully agrees with the old benchmarks. I'm still getting some differences in the ICRMSD (but it's improved compared to using the InChI to create the molecule), so there may be another difference going on there. Screen Shot 2024-05-07 at 5 36 30 PM