pckroon / pysmiles

A lightweight python-only library for reading and writing SMILES strings
Apache License 2.0
147 stars 21 forks source link

init draft aromatic #38

Closed fgrunewald closed 6 months ago

fgrunewald commented 6 months ago

This is the initial draft of the new aromaticity algorithm following the ideas outlined here and here.

The key difference is that we are not trying to assign chemical aromaticity but rather kekulize the molecule (i.e. fixing hcount). In a nutshell, the algorithm proceeds as follows:

  1. Assign a preliminary hcount to all non-hydrogen atoms. This is a bit awkward but needed for the next step, because we want to be able to deal with cases where implicit hydrogen are part of the non-aromatic atoms.
  2. Remove all nodes that have a full valance, which we can only assess after implicit hydrogens have been added to non-aromatic residues.
  3. Get the connected components of the resulting fragmented graph. Each component has to be a delocalized system and is potentially aromatic.
  4. For each component we check if there exists a maximum matching and if that matching is perfect. If it is not perfect the delocalized system is written incorrectly and a syntax error is raised. It's like checking if we have perfect alternating single and double bonds. 5a. If the system is cyclic we assume it to be (anti-)aromatic and give it a bond order of 1.5.
    5b. If it is not cyclic then we simply assign a bond order of 2 to the edges that constitute the perfect matching.

Some differences in behaviour to the previous version:

SMILES VALID AROMATIC old new
c1c[nH]cc1 yes no pass pass
c1cNcc1 yes no pass pass
c1cncc1 no raises Error fail (no hydrogen on N) pass
c1cscc1 yes no fail pass
c1cScc1 yes no pass pass
c1cnc[nH]1 yes no pass pass
c1cncN1 yes no pass pass
N12ccccc1ccc2 yes no pass pass
n12ccccc1ccc2 yes no pass pass
c12ccccc1Ncc2 yes no pass pass
c12ccccc1[nH]cc2 yes no pass pass
c12ccccc1ncc2 no raises Error fail (no hyrdogen on N) error
c1cscn1 yes no fail pass
cccc yes no fail (raises Error) pass
OCCn2c(=N)n(CCOc1ccc(Cl)cc1Cl)c3ccccc23 yes partly fail ( h on aromatic n) pass

The molecule from this blog-post mentioned in #19 is also fixed.

Overall I'd say this algorithm is more robust as it raises an Error for hard fails like c1cncc1 but is also linenet towards chemically intuative smiles like cccc.

The major problems are:

pckroon commented 6 months ago

Many thanks! First your questions, I'll have a look at the code as well.

how to deal with wildcards ? For now they are just ignored because we don't know the valency and bond order. That means []1[][*]1 is not aromatic anymore.

Wildcard should always be able to form a double bond without inducing a charge, so they should be part of the delocalized subgraph. I'm not a hundred percent sure why [*]1[*][*]1 should be aromatic though, since it only has 3 atoms.

how to deal with charges in the initial valence assignment ? I think it should be missing = valence - bonds + charges

I need to brood on this. Can this be summarized in such a general way, or do we need to do actualy octet-rule?

how to assign aromaticity for fused rings. Currently naphthalene is not aromatic, which might be fine???

This is a problem IMO, naphthalene does show DIME.

fgrunewald commented 6 months ago

@pckroon small update: so naphthalene is correctly assigned (now). I had an earlier error. All systems that show DIME are identified as aromatic as far as test-cases go. Only those like Thiophene are not.

pckroon commented 6 months ago

Could you add 2 more testcases? One with a triangle, and one that cannot be kekulized and trips the error. That should bring the coverage back up