openforcefield / cmiles

Generate canonical molecule identifiers for quantum chemistry database
https://cmiles.readthedocs.io
MIT License
23 stars 7 forks source link

Drop implicit H that gets added for negative N and other fixes #20

Closed ChayaSt closed 5 years ago

ChayaSt commented 5 years ago

This PR fixes several issues:

Testing

The laster version of rdkit (2019.03.1) has better handling of stereoisomers and also now considers a N in a three membered ring as stereogenic. (https://github.com/rdkit/rdkit/issues/2268) This broke some tests for canonical isomeric SMILES and InChIs. The older version of rdkit dropped some stereo information for these SMILES:

'C[N@@+]12C[C@@H]([C@@H](CC1)CC2)OC(=O)C(O)(c3ccccc3)c4ccccc4',
'C[C@@]12CC[C@@H](CC1)C(C)(C)O2',
'[H][C@]12C(=O)N(C(=O)[C@@]1([H])[C@@]1(CC[C@@]2([H])CC1)NC(=O)OCC(O)=O)C1=CC=C(NC(C)=O)C=C1',
'C[C@@]12OO[C@@](CCC(O)=O)(C3=CC=CC=C13)C1=C2C=CC=C1',
'CC(N1CCN(CC1)c2ncc(cc2)C(F)(F)F)(C(=O)N[C@H]3[C@@H]4C[C@@]5(CC(C4)C[C@H]3C5)C(=O)N)C',
'CC1(C)[C@H]2CC[C@]1(C)CC2'

3 SMILES that have a N in a three membered ring now behave differently between rdkit and openeye.

  1. CC1=C(C(=O)C2=C(C1=O)[N@@]3C[C@H]4[C@@H]([C@@]3([C@@H]2COC(=O)N)OC)N4)N Mitomycin - The N here has an H bonded to it. OE does not consider it stereogenic (https://github.com/openforcefield/openforcefield/issues/146#issuecomment-490177618)
  2. C1[C@H]2N1C(=O)N=C2N Imexon - Corner case. The N is bridged with the C and therefore the chirality of the N is implicitly defined. RDKit considers it undefined but openeye does not
  3. [H][C@@]1(C)C[N@@]1C1=CC(=O)C2=C(C(CO)=C(N2C)C2=CC=CC=C2)C1=O - This might be a bug in rdkit. It seems like rdkit ignores the stereo information on the N and considers the chirality undefined.

Because of these changes and fixes to the tests, CI environments were changed to start testing against rdkit 2019 and later. The 3 SMILES that behave differently between rdkit and openey were added to a separate test that only tests it against openeye.

ChayaSt commented 5 years ago

The following fixes were added to this PR:

  1. Handle non integer bond orders in connectivity table
  2. Fixed provenance
  3. Do not remove explicit hydrogen when loading an rdkit molecule from an explicit H SMILES (This is not the default behavior in rdkit)
  4. This fix introduced spurious stereo warnings because rdkit considers the the map indices on H when checking for stereo. Removing the map indices and storing them in atom data does not help because that information is used as well.
codecov-io commented 5 years ago

Codecov Report

Merging #20 into master will decrease coverage by 0.5%. The diff coverage is 88.46%.