ncats / lychi

Layered Chemical Identifier
Apache License 2.0
14 stars 10 forks source link

E/Z perception on tautomers #5

Open tylerperyea opened 10 years ago

tylerperyea commented 10 years ago

In certain cases, unspecified E/Z information is encoded as known (or known E/Z information is lost) based on tautomer generation.

Example 1

Consider the following two structures, which have the same smiles, but are drawn differently (molfiles at the bottom).

Compare:

CN1C(=O)/C(=N\NC(N)=S)C2=CC=CC=C12

cistrans2

2ATKPHXN6-63AZUWLKFU-6U9JBVHA63M-6UM6PRK6J3GX

vs

CN1C(=O)/C(=N\NC(N)=S)C2=CC=CC=C12

cistrans

2ATKPHXN6-63AZUWLKFU-6U9JBVHA63M-6UMV4H3F7CST

Notice that while the smiles representations are exactly the same, the structures still get different hashes based on their initial coordinates. This happens because the cannonical tautomer has a different E/Z bond location than the one drawn above:

CN1C(=O)/C(=N\NC(N)=S)C2=CC=CC=C12

cistrans3

After selecting the prefered tautomer, E/Z is apparently recalculated based on the original atom coordinates. This leads two apparently identical structures to have different hashes.

The resolution to this problem isn't trivial, and is more a shortcoming of valance bond theory than of the encoding in general. This will require a bit of research, and an expert should be consulted. My intuition is that any cis/trans designation should be allowed if (and only if) both involved bonded atoms remain in an sp2 hybridized state across all tautomers (therefore the atoms and their substituents should remain coplanar).

If this is accurate, there is an unfortunate corollary: The prefered tautomer in the above example is either wrong, or should capture cis/trans information about the exocyclic bond, even though it is not explicitly a double bond.

The molfiles for the above structures are posted here for convenience:


  Ketcher 12191320432D 1   1.00000     0.00000     0

 16 17  0     0  0            999 V2000
    0.4048    3.7213    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.0739    2.9781    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.0684    3.0827    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5684    3.9487    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.4752    2.1691    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.4534    1.9612    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    4.1225    2.7044    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.1006    2.4964    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4096    1.5454    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.7697    3.2396    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7321    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8660    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8660    2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0     0  0
  2  3  1  0     0  0
  3  4  2  0     0  0
  3  5  1  0     0  0
  5  6  2  0     0  0
  6  7  1  0     0  0
  7  8  1  0     0  0
  8  9  1  0     0  0
  8 10  2  0     0  0
  5 11  1  0     0  0
 11 12  2  0     0  0
 12 13  1  0     0  0
 13 14  2  0     0  0
 14 15  1  0     0  0
 15 16  2  0     0  0
 16  2  1  0     0  0
 16 11  1  0     0  0
M  END

  Ketcher 12191320472D 1   1.00000     0.00000     0

 16 17  0     0  0            999 V2000
    8.1745    3.7213    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.8436    2.9781    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    9.8381    3.0827    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   10.3381    3.9487    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   10.2449    2.1691    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   11.2231    1.9612    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   11.8922    2.7044    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   11.4775    3.6036    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   12.1079    4.5454    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   10.7180    3.7039    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    9.5018    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.6357    2.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.7697    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.7697    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    8.6357    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    9.5018    0.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0     0  0
  2  3  1  0     0  0
  3  4  2  0     0  0
  3  5  1  0     0  0
  5  6  2  0     0  0
  6  7  1  0     0  0
  7  8  1  0     0  0
  8  9  1  0     0  0
  8 10  2  0     0  0
  5 11  1  0     0  0
 11 12  2  0     0  0
 12  2  1  0     0  0
 12 13  1  0     0  0
 13 14  2  0     0  0
 14 15  1  0     0  0
 15 16  2  0     0  0
 16 11  1  0     0  0
M  END
caodac commented 10 years ago

Yeah, this example highlights the subtle difference between using SMILES vs MOL as input. In the latter case, the coordinates are preserved and any double bond generated due to tautomer takes on whatever configuration implied by the coordinates. I guess the question is whether the Z in the input is real or just a byproduct of how it's drawn.

tylerperyea commented 10 years ago

I would tend to say any E/Z configuration that emerges from generated / provided coordinates after tautomer generation should be considered bogus. In fact, I wonder how much things would change if you explicitly ignore E/Z in cases where that bond can become single via a tautomer ... I think that solution is 95% correct, but I'd have to see if there are problem cases...

caodac commented 10 years ago

By default any double bond generated by tautomer enumeration should not have any E/Z stereo annotated. The problem might be due the output serializer that automatically perceives E/Z stereo based on the coordinates regardless whether E/Z stereo is annotated or not.

tylerperyea commented 10 years ago

That makes sense. The simple solution, then, is to force a "double either" bond type on all tautomer-generated bonds, rather than a typical "double". I believe you can do that explicitly, and it should behave exactly like an unannotated double bond, regardless of coordinates.

southalln commented 10 years ago

That is kind of unsatisfying as if one had started with the 'canonical tautomer' - that would be considered to have E/Z stereo, but the other molecule would get a different InCHI with an explicit lack of E/Z stereo. The point is that we expect all these structures to be in equilibrium with one another. Can we generate both explicit E and Z when you have tautomer generated double bonds? Then choose 1 canonical struct across the whole set - if that one winds up having E or Z, so what, at least all the right starting structs will be grouped together.

Noel

On Dec 20, 2013, at 4:07 AM, "Tyler Peryea" notifications@github.com<mailto:notifications@github.com> wrote:

That makes sense. The simple solution, then, is to force a "double either" bond type on all tautomer-generated bonds, rather than a typical "double". I believe you can do that explicitly, and it should behave exactly like an unannotated double bond, regardless of coordinates.

— Reply to this email directly or view it on GitHubhttps://github.com/ncats/lychi/issues/5#issuecomment-30996956.

tylerperyea commented 10 years ago

I think that's reasonable. I believe you're saying that any E/Z center annotations on tautomeric bonds should be effectively disregarded/collapsed. I think I agree with this. The simplest approach, I believe, is just to force all double bonds that can undergo tautomerism into double/either bonds.

However, this is where I need someone more knowledgable about organic chem than I am. I am embaressed to admit that I have a hard time distinguishing some of the "equillibrium problems" from "resonance / failures of valance bond model". For example, the following are equivalent in the standardizer:

conjugate

Is it actually the case that the double bonds are interconverting, and allowing for different free-rotation around the single bonds (in which case all E/Z is then an illusion)? Or is it that this is a conjugated pi system, with a locked conformation, imperfectly described as alternating double bonds (in which case we need to respect orientation regardless of which form is used for drawing)? More practically, are the following also strictly equivalent to the above:

others

southalln commented 10 years ago

Ok, good point. The electrons are delocalized - but the geometry is relatively fixed.

http://onlinelibrary.wiley.com/doi/10.1002/jhet.5570200439/abstract

I cant get the pdf, but the abstract seems to imply that the hydrazide (with free rotation) is in equilibrium with the hydrazone in solution, so there is not a true E/Z center here.

So, getting back to my point of when you create a E/Z center, you should include both E and Z as tautomers ... In the example at the bottom below, there is never a tautomer where you are creating a new center --- the -[NH]- case isnt one of the tautomer forms enumerated here.

If it were, then ....

-N-N=[eingang]ring_kekule_one -N-[NH]-ring_kekule_two -N-N=[cisgang]ring_kekule_one

On Dec 20, 2013, at 5:39 AM, "Tyler Peryea" notifications@github.com<mailto:notifications@github.com> wrote:

I think that's reasonable. I believe you're saying that any E/Z center annotations on tautomeric bonds should be effectively disregarded/collapsed. I think I agree with this. The simplest approach, I believe, is just to force all double bonds that can undergo tautomerism into double/either bonds.

However, this is where I need someone more knowledgable about organic chem than I am. I am embaressed to admit that I have a hard time distinguishing some of the "equillibrium problems" from "resonance / failures of valance bond model". For example, the following are equivalent in the standardizer:

[conjugate]https://f.cloud.github.com/assets/1581898/1789710/f4db43ce-6960-11e3-9aa2-5f55bdbab1ee.png

Is it actually the case that the double bonds are interconverting, and allowing for different free-rotation around the single bonds (in which case all E/Z is then an illusion)? Or is it that this is a conjugated pi system, with a locked conformation, imperfectly described as alternating double bonds (in which case we need to respect orientation regardless of which form is used for drawing)? More practically, are the following also strictly equivalent to the above:

[others]https://f.cloud.github.com/assets/1581898/1789754/4e1c1a70-6962-11e3-9fa0-8a72a71d5012.png

— Reply to this email directly or view it on GitHubhttps://github.com/ncats/lychi/issues/5#issuecomment-31001566.

tylerperyea commented 10 years ago

Not sure if I followed you completely ... but I think I get what you're saying. From what I can see, generating both E & Z forms for every new tautomer, and arbitrarily picking one canonical form, with one canonical E/Z isomerism is reasonable. I think it's functionally equivalent to ignoring E/Z isomerism among atoms involved in tautomerism (which also seems reasonable to me). If there are times when E/Z in alterable bonds is meaningful though, I'm not sure I catch the general rule that could be coded.

To further clarify, the following appear quite different to me, and I'd intuitively give them different hashes. But the difference is technically on the s-cis/c-trans configuration, and so both produce the same smiles, inchi, etc ... another

Once the bonds are alternated, the superficial annotation becomes a real annotation, and that center becomes true cis, which suddenly makes these two structures distinct. If we say all such cases are just considered to be interconverting, this is pretty easy. But I don't think that's the case here (is it?). If these two things are different, we'd have to make sure the tautomer canonicalizer lands on a form that preserves their differences. I'm not sure if that's reasonable or not ...

southalln commented 10 years ago

Perhaps this is only a problem for compounds with a hetero atom at either end of the conjugation, or maybe also in the middle of conjugation.

Noel

From: Tyler Peryea [mailto:notifications@github.com] Sent: Friday, December 20, 2013 7:06 AM To: ncats/lychi Cc: Southall, Noel (NIH/NCATS) [E] Subject: Re: [lychi] E/Z perception on tautomers (#5)

Not sure if I followed you completely ... but I think I get what you're saying. From what I can see, generating both E & Z forms for every new tautomer, and arbitrarily picking one canonical form, with one canonical E/Z isomerism is reasonable. I think it's functionally equivalent to ignoring E/Z isomerism among atoms involved in tautomerism (which also seems reasonable to me). If there are times when E/Z in alterable bonds is meaningful though, I'm not sure I catch the general rule that could be coded.

To further clarify, the following appear quite different to me, and I'd intuitively give them different hashes. But the difference is technically on the s-cis/c-trans configuration, and so both produce the same smiles, inchi, etc ... [another]https://f.cloud.github.com/assets/1581898/1790106/7168497c-696c-11e3-8c99-d38e1165cdc2.png

Once the bonds are alternated, the superficial annotation becomes a real annotation, and that center becomes true cis, which suddenly makes these two structures distinct. If we say all such cases are just considered to be interconverting, this is pretty easy. But I don't think that's the case here (is it?). If these two things are different, we'd have to make sure the tautomer canonicalizer lands on a form that preserves their differences. I'm not sure if that's reasonable or not ...

— Reply to this email directly or view it on GitHubhttps://github.com/ncats/lychi/issues/5#issuecomment-31005514.

tylerperyea commented 10 years ago

sorry just observed these are the same

Generates the same InChIKeys: InChIKey=ZBJVOFZCUWXHIV-ADCBRGLKSA-N InChIKey=ZBJVOFZCUWXHIV-ADCBRGLKSA-N

But different hashes: TTNSJCV6Q-Q1HV946XCC-QC87J93L3QQ-QCQ6APSFXZ1K TTNSJCV6Q-Q1HV946XCC-QC87J93L3QQ-QCQX5C242L1M

I'm not really sure if they should be the same or different ... I will do some literature searches / phone-a-friend.

caodac commented 10 years ago

Again, this is similar in the original example. The only reason why they would be different is because of the coordinates. If you feed the input as SMILES, then the hash keys should be the same. Notice that InChI says that there is an E/Z (not sure which) in this example.

caodac commented 10 years ago

Ok, I've just pushed upstream the fix (532cd27). Basically instead of setting the bond flag to CIS|TRANS it was setting it to 0.