Closed amcisaac closed 6 months ago
Thanks for the thorough report -
I'd propose using the mapped SMILES to create the molecule instead of the InChI key:
Seems like a good idea to me
Which dataset is the molecule from? Is there another dataset that would be useful for testing this against #46?
This molecule is from the Industry Benchmark dataset. I can check out your PR and see how it performs on the whole dataset, if you want. I wrote my own function with the code I suggested above (slightly different than what you implemented, but I think it should have the same effect), and it corrected the issue across the whole industry benchmark dataset. It would take a couple of hours to re-compute the TFD's with your PR and see (presumably) the same improvement
Alternatively I can look for other molecules in this dataset to use as a case study, or run a benchmark on a smaller and more manageable dataset to use for testing/debugging. Let me know what would be most helpful
If you don't mind using that branch, I'll take you up on that. I figure the chances of it working as hoped are high, but I'd like to see the test results before tagging a release with it included
This was the result. Looks like it fixes it, and fully agrees with the old benchmarks. I'm still getting some differences in the ICRMSD (but it's improved compared to using the InChI to create the molecule), so there may be another difference going on there.
While comparing the YAMMBS benchmark of Sage 2.1 to our old benchmarks from the Sage 2.1 release, I noticed the TFD's are very off, and notably some of the YAMMBS TFD results are greater than 1, which I believe is supposed to be impossible.
The worst offender is QCAID
43421364
. Here, the old Sage 2.1 benchmarks give 0.085 for the TFD, whereas YAMMBS gives 2.07.I believe this is coming from the molecule creation in line 721, in
store.get_tfd
hereWhen I create a molecule from the InChI key (as in the YAMMBS code), I get a different order of atoms than when I create it from the mapped smiles, although both were generated by/retrieved from the SQLITE by YAMMBS:
For all of the molecules in my sqlite database, the atom order between the InChI key and mapped smiles are inconsistent.
If I use the
smiles_molecule
to calculate TFD, I get ~0.085, in agreement with the old benchmarking code:If I add the QM geometry and visualize it, it appears that the
smiles_molecule
has the correct atom order/connectivity, whereas theinchi_molecule
is not correct:This suggests to me that the unphysical value of TFD is coming from the inconsistent atom ordering, and I'd propose using the mapped SMILES to create the molecule instead of the InChI key:
I'd be happy to make a PR to make this change if it is this simple, but it seems like something that might have cascading effects or indicate a problem somewhere else.
This shouldn't affect the ddE as it looks like that never creates a
Molecule
. However, both the RMSD and ICRMSD define theirMolecule
objects the same way using the InChI key. I wouldn't expect it to affect the RMSD, as it looks like theMolecule
is mostly ignored and the RMSD is calculated from the geometry arrays directly. I do think it should affect the ICRMSDs, as those appear to use theMolecule
object directly. When I tested it out, I got slightly different ICRMSD values if I used a molecule created from SMILES vs InChI key, but the effect was much smaller (e.g. 0.013 A vs 0.014 A error in bond length for this sample molecule).Structures and other info required to reproduce: InChI key (generated by YAMMBS from my Sage 2.1 SQLITE file)
inchi_key = 'InChI=1/C20H20ClFN2O4/c1-28-17-4-2-13(3-5-17)12-24-7-6-20(27,19(24)26)18(25)23-11-14-8-15(21)10-16(22)9-14/h2-5,8-10,27H,6-7,11-12H2,1H3,(H,23,25)/t20-/m0/s1/f/h23H'
Mapped SMILES (generated by YAMMBS from my Sage 2.1 SQLITE file)
smiles = '[H:33][c:5]1[c:4]([c:3]([c:28]([c:27]([c:6]1[C:7]([H:34])([H:35])[N:8]2[C:25](=[O:26])[C@:11]([C:10]([C:9]2([H:36])[H:37])([H:38])[H:39])([C:13](=[O:14])[N:15]([H:41])[C:16]([H:42])([H:43])[c:17]3[c:18]([c:19]([c:21]([c:22]([c:24]3[H:46])[Cl:23])[H:45])[F:20])[H:44])[O:12][H:40])[H:47])[H:48])[O:2][C:1]([H:29])([H:30])[H:31])[H:32]'
Conformers: