ncats / lychi

Layered Chemical Identifier
Apache License 2.0
14 stars 10 forks source link

Meaningless Stereo is sometimes honored #4

Open tylerperyea opened 10 years ago

tylerperyea commented 10 years ago

Some meaningless stereo annotations (wedge/dash bonds) produce different hashes than non-annotated bonds.

Example 1

Compare:

C[C@H]1OC(C)O[C@@H](C)O1

stereononsense

WDF2GBCFX-X5KQLPPFPK-XK7RRPGCCM2-XK25W2RXGM3Z

vs

CC1OC(C)OC(C)O1

stereononsense2

WDF2GBCFX-X5KQLPPFPK-XK7RRPGCCM2-XK23BSZ142DG

In this example, there are 3 stereo centers that could be annotated. However, out of the 8 absolute permutations, only 2 are actually unique: stereononsenseexplained

You'll notice that of all 8 possibilities, only 2 are non-degenerate. And in both cases, it must be the case that at least two adjacent methyl groups are on the same side of the ring. So the information provided by the first structure is self-evident.

The InChI algorithm does handle this specific case (possibly by accident), but it does not handle the general issue, as explained in example 2.

Example 2

Compare:

[C@H](C)1CCC(C)CC1

stereononsense3

T75RBW5S8-8D9T563A7Y-8YC8NQXD9W5-8Y5APDLVJ782

vs

C(C)1CCC(C)CC1

stereononsense4

T75RBW5S8-8D9T563A7Y-8YC8NQXD9W5-8Y5VPCVHUV1Z

Again, these should be equivalent, but currently generate different hashes. For reasons I can't imagine, the two above also generate different InChIs. This is especially odd, considering it's a much simpler case of the general problem explained in example 1.

caodac commented 10 years ago

Gorgeous embedded images!

tylerperyea commented 10 years ago

Thanks! I'm probably getting carried away with them...

On the resolution of this, one naive solution is to do the following: 1) Mark all potential non-annotated stereo centers for which R/S is not applicable 2) Generate canonical hashes for all possible absolute permutations at those centers 3) Canonically hash the unique set of possible results with current known R/S configurations 4) Apply this result as the exclusive form of stereochemical encoding in the hash

I believe that would work, in theory ... but a few of the logistics are problematic. Also, there is the potential for a combinatorial explosion in some of these cases. A non-annotated inositol is the worst real case I can think of (with 64 naive permutations), but I'm sure there are worse examples that are still relevent.