ncats / lychi

Layered Chemical Identifier
Apache License 2.0
14 stars 10 forks source link

Fix symmetry problem with isotopes / certain bond configurations. #18

Closed tylerperyea closed 3 years ago

tylerperyea commented 5 years ago

One strange case recently discovered deals with deuterium. Lychi layer 3 should be the same for 2 structures that share atom labels, connectivity, and bond order even if they differ in isotope specifications or stereochemistry.

It was found, however, that this isn't always the case. For instance, consider the following example:

image

[H][C@](O)(C=O)[C@@]([H])(O)[C@]([H])(O)[C@]([H])(O)C([2H])([2H])O
[ZDW67PZ1Z-Z9F4TA3WHF-ZF112L93LTY-ZFY5ANL7Y5D6]

image

[H][C@@](O)(CO)[C@@]([H])(O)[C@]([H])(O)[C@@]([H])(O)C=O
[ZDW67PZ1Z-Z9F4TA3WHF-ZF4AR62864F-ZFF89YX42YF1]

You would expect the 2 above to receive the same lychi layer 3, but they don't. The reason for this is more complicated than you might think, and actually has some other side effects. To understand this, it's helpful to have a little background on how the lychi itself works. Lychi first standardizes a structure (basic cleanup/charge normalizing / tautomer selection) and this results in a fairly well-standardized molecule object. Next, it creates canonical smiles for that standardized molecule, and attempts to create a new molecule object from being reloaded. This effectively standardizes the order of the atoms so there's a basic assumption that they are roughly in a canonical ordering. All other calculations for layers then occur after this.

For Layer 1, this is accomplished by setting all atoms to be carbon, and all bonds to be single, and then taking the canonical smiles. For Layer 2 this is accomplished by walking through the molecule starting with the lowest Layer-1 canonical atom and just reporting the atomic symbols. For Layer 3, this is accomplished by walking through the molecule in the Layer-1 canonical order and just reporting the number of implicit hydrogens at each site.

The problem with this approach is that it fundamentally uses an atom order established by the layer-1 carbon skeleton, but layer 2 / layer 3 may have symmetry-breaking "colorings" on the atoms that would make one path distinct from another, where the layer-1 view would consider the choice of paths equivalent. To fix this, the graph invarient numbers calcuated for each node are just multiplied by the atomic number, as shown below:

image

There are a few issues with this approach:

  1. The way the graph invariant is used for canonical numbering is by starting from the lowest value node and "walking" to each neighbor node that has the lowest graph invariant number. This localized walk, in general, doesn't have a good way to deal with "ties".
  2. The basic "tie-breaking" addition of the atomic number factor does break some layer-1 ties, but it can actually create other ties for priority. If the original graph invariant numbers were 2 and 3 for 2 different nodes, for example, and those nodes turned out to be F and C respectively, they would end up having a tie for priority as 2x9 is the same as 3x6 (this happens in a few known cases).
  3. There is no tie-breaking addition that considered bond-order, so there is no clear deterministic way to walk a symmetric chemical graph whose symmetry was broken by the addition of 1 or more double/triple bonds.

All of these issues turn out not to be as catastrophic as they might first appear because the defualt tie-breaking rule for priority is to use the full "layer-4" canonical smiles order of the molecule. This means that most of the time the ties will be resolved in a deterministic way. However, if a small change to a molecule's stereochemistry, isotopic state, radical or charge occurs, and this results in a change to the underlying canonical smiles ordering, then the tie-breaking rules will no longer work in an equivalent way. This means that changes to many things that appear to be layer-4 criteria can actually end up changing either layer 3 and/or layer 2.

This pull request does not address every concern possible, but it does make a basic attempt to fix 2 and 3 above. It does this in a pretty simple way. For issue 2 above, it first multiplies the atomic number*graph invariant, as before, but it also multiplies by a large number, so as to give some "breathing room" for breaking ties. It then breaks all "collision" ties by adding a multiple (5) of the atomic number to the rank number.

To fix issue 3 above, it also adds a tie-breaking step to the rank score by subtracting the implicit hydrogen count (a stand-in for the bond order) to each atom's rank. Since this number will always always be less than 5, the tie breaking here should be "below" the resolution of atomic number tie-breaking shown above.

This is not really an ideal solution to fix this set of bugs, but it's meant to be a simple way to fix 99% of the broken cases while not changing the underlying lychi hashes.

tylerperyea commented 5 years ago

Note that this also includes the pull from https://github.com/ncats/lychi/pull/17/