ncats / lychi

Layered Chemical Identifier
Apache License 2.0
14 stars 10 forks source link

Non-stereo Encoding Problem with Explicit Hydrogens #7

Open tylerperyea opened 10 years ago

tylerperyea commented 10 years ago

In certain cases, explicit hydrogens seem to cause trouble for the atom-labelling layer of the hash. In these cases, it seems that the smiles generated by the standardizer produces a different hash than the input molfile itself.

Consider the following poorly layed-out structure: encodeprob [molfile below]

Direct generation of hash from this Std_SMILES:

[H][C@@]12CC3=C(C(O)=C(OC)C(C)=C3)[C@@]([H])(N1C)[C@@]4([H])N([C@H]2O)[C@@]5([H])COC(=O)[C@]8(CS[C@]4([H])C6=C5C7=C(OCO7)C(C)=C6OC(C)=O)NCCC9=C8C=C(OC)C(O)=C9

And this hash:

DCLRH149F-FGAV2BD6PA-FA8DSLTXL4L-FALJX635AFC5

However, when that same smiles is fed into the standardizer, I get:

DCLRH149F-FFMPLZ16VC-FC1Y2MQMGXU-FCUZ42LBF8VB

If the explicit hydrogens are removed entirely: encodeprob2

The output hash is now compatible with the smiles.

DCLRH149F-FFMPLZ16VC-FC1Y2MQMGXU-FCU1SY5C8458

Molfile for explicit hydrogen version:


  Ketcher 12201304332D 1   1.00000     0.00000     0

 59 67  0     1  0            999 V2000
   -2.2321   -1.8660    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321   -1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5981   -0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5981    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4641    1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4641    2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.3301    2.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5981    2.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5981    3.5000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4641    4.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321    2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660    2.5000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321    1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4740    1.2647    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7321    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.9071   -0.4750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    1.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000   -1.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.8660   -2.5000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.8660   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8561   -2.3746    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    2.4488   -3.1947    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5544   -3.0234    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.3132   -1.2000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.5741   -1.3179    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.8632    0.2250    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0294    0.9234    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.9488    1.1197    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    0.8660    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8811    1.3246    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5981    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4244    1.4848    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.6097    2.0768    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7419    1.9858    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7927    2.9165    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.4641    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.9301    0.3000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4641   -1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2072   -1.6691    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.8005   -2.5827    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.8060   -2.4781    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.5981   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321   -1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.6506   -0.2222    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    6.4172    0.1894    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.4966    1.1232    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.8342    1.6954    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.0136    2.6792    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.2512    3.3264    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4306    4.3102    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.3096    2.9897    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.5473    3.6370    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.7266    4.6207    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.1303    2.0060    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.8676    1.3838    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  1     0  0
  2  3  1  0     0  0
  3  4  1  0     0  0
  4  5  1  0     0  0
  5  6  2  0     0  0
  6  7  1  0     0  0
  6  8  1  0     0  0
  8  9  1  0     0  0
  9 10  1  0     0  0
  8 11  2  0     0  0
 11 12  1  0     0  0
 11 13  1  0     0  0
  4 13  2  0     0  0
 13 14  1  0     0  0
 14 15  1  1     0  0
 14 16  1  0     0  0
  2 16  1  0     0  0
 16 17  1  0     0  0
 14 18  1  0     0  0
 18 19  1  1     0  0
 18 20  1  0     0  0
 20 21  1  0     0  0
  2 21  1  0     0  0
 21 22  1  1     0  0
 20 23  1  0     0  0
 23 24  1  1     0  0
 23 25  1  0     0  0
 25 26  1  0     0  0
 26 27  1  0     0  0
 27 28  2  0     0  0
 29 27  1  0     0  0
 29 30  1  6     0  0
 30 31  1  0     0  0
 31 32  1  0     0  0
 18 32  1  0     0  0
 32 33  1  1     0  0
 32 34  1  0     0  0
 34 35  1  0     0  0
 35 36  1  0     0  0
 36 37  1  0     0  0
 37 38  1  0     0  0
 37 39  2  0     0  0
 35 40  2  0     0  0
 40 41  1  0     0  0
 40 42  1  0     0  0
 42 43  1  0     0  0
 43 44  1  0     0  0
 44 45  1  0     0  0
 45 46  1  0     0  0
 42 46  2  0     0  0
 46 47  1  0     0  0
 23 47  1  0     0  0
 34 47  2  0     0  0
 29 48  1  0     0  0
 48 49  1  0     0  0
 49 50  1  0     0  0
 50 51  1  0     0  0
 51 52  1  0     0  0
 52 53  2  0     0  0
 53 54  1  0     0  0
 53 55  1  0     0  0
 55 56  1  0     0  0
 56 57  1  0     0  0
 55 58  2  0     0  0
 58 59  1  0     0  0
 29 59  1  0     0  0
 51 59  2  0     0  0
M  END
caodac commented 10 years ago

This example isn't so much about explicit H's but more about their parities. To fix this we need to be able to define a canonical set of parity flags for the specified stereocenters. This is far from a trivial fix.

tylerperyea commented 10 years ago

It can't just be parity though, because this messes up the atom label layer, not just the stereo layer. If it only had a problem with stereo, that'd be less concerning.

caodac commented 10 years ago

This particular example should now work properly from the recent commit 9b38dbdd97da53e3a4a5e3ba458f7d6784df07e2.