mojaie / MolecularGraph.jl

Graph-based molecule modeling toolkit for cheminformatics
MIT License
189 stars 27 forks source link

bond aromaticity not stored in `SmilesBond` #75

Closed eahenle closed 1 year ago

eahenle commented 1 year ago

Loading an aromatic molecule from SMILES using smilestomol gives a GraphMol object, mol. isaromaticbond(mol) returns the correct BitVector for which bonds are aromatic; [x.isaromatic for x in mol.edgeattrs] does not.

Reading the values from the struct is 10x faster than running the cached query function, but because the struct is immutable and the aromaticity is not encoded when the struct is made, this cannot return the correct result for molecules with aromatic bonds. It seems that there should be the option to store the correct value of aromaticity at object instantiation, and/or that the default value for this field should be missing rather than false.

mojaie commented 1 year ago

Yes, as you mentioned, GraphMol node/edge objects are immutable, and calculated descriptors like isaromaticbond vectors can be stored as cache. As I'm a big fan of reproducibility, I always want to avoid changing original values to get consistent results. But in terms of performance, I agree that there should be options to make struct node/edges that have calculated descriptors (FastAtom/Edge?). Or you can interface Atom/Bond to make new Atom/Bond type to do so.

Here is a concept about the design Basics of molecular graph https://nbviewer.org/github/mojaie/MolecularGraph.jl_notebook/blob/master/notebook/molecularGraphBasics.ipynb

Some graph operations need fast reading of descriptor values of each elements, so I'm also very interested in improving performance. I'd be happy if you share benchmarking results comparing struct and cashed array.

eahenle commented 1 year ago

Notebook showing benchmarks: 🎈 Fun lecture.jl — Pluto.jl.pdf

I think the best solution is to store the correct bond aromaticity when the GraphMol is created. It will make object instantiation a little slower, but there will be no possibility of the user misinterpreting the information stored in mol.bondattrs[i].isaromatic. To keep the instantiation fast but solve the misunderstanding, storing the aromaticity labels could be optional, but in that case isaromatic should be set to missing.

mojaie commented 1 year ago

The molecule in your demo code is implicit aromaticity (e.g. C1=CC=CC=C1). You may realize the problem is not so simple if you deal with large fused ring systems.

At first, isaromaticbond calculate SSSR (the smallest set of smallest ring) and check if satisfies the Huckel rule for each smallest rings. The complexity of SSSR (minimum cycle basis) detection is known to be $O(νm^3)$, where $ν$ is the number of rings and $m$ is the number of edges (Vismara, 1997).

If you are confident enough with the aromaticity of the molecule, you can use explicit aromaticity query(e.g. c1ccccc1) and then maybe you can skip SSSR calculation.

eahenle commented 1 year ago

Another version of the notebook with the previous example SMILES encoded with explicit aromaticity and a large fused ring example: 🎈 Surprising blueprint.jl — Pluto.jl.pdf

The explicit aromaticity encoding makes isaromaticbond about twice as fast, but it is still milliseconds without precalculate!, microseconds with precalculate!, and around 100 nanoseconds by reading from the struct. The fused ring example does take a lot longer, but accessing the data in the struct is still the fastest approach by a wide margin (tens of milliseconds, tens of microseconds, and under 1 microsecond, respectively).

If I understand correctly, you are suggesting a parameterized type construction for GraphMol{FastAtom, FastEdge} that would store the calculated values? I could make a PR for that. But, I still strongly suggest changing the existing GraphMol{SmilesAtom, SmilesEdge} constructor to assign missing instead of false in the isaromatic field, so that it cannot be misconstrued as actual data (I am willing to implement this, as well).

mojaie commented 1 year ago

Thank you for sharing your benchmark results. I think faster struct node/atom is useful, too. But I want to avoid direct replace of SmilesAtom/Edge because it is a really breaking change, now. I believe vector representation of the descriptors are also useful, so I need a few more other opinions.

I checked the SMARTS modules and found that SmilesAtom/Edge construction is hard-coded in the SMARTS methods. One possible solution is to use parameterized types for these functions.

For example,

https://github.com/mojaie/MolecularGraph.jl/blob/00c880e6b11b55be859bf01ee0f920ae957175b3/src/smarts/bond.jl#L70-L81

should be

function bond!(state::SmartsParser)
    if read(state) == '~'
        forward!(state)
        return edgetype(state)()
    end
    fml = lglowand!(state, bondsymbol!)
    if fml !== nothing
        fml = tidyformula(fml)
        return edgetype(state)(fml)
    end
    # return nothing: Invalid bond token or implicit single bond
end

and implement edgetype(state) that returns SmartsParser edge type.

Many atom/bond functions are implemented against Atom/Bond interface, so then SmartsParser{FastAtom,FastBond} would work (I hope).

mojaie commented 1 year ago

Maybe someone want to implement GraphMol{CoolAtom,CoolBond}, CoolMol{SmilesAtom,SmilesBond} or something, and also nice_descriptor(GraphMol), another_implementation_of_nice_descriptor(GraphMol) inside/outside this library. I think it is important that everyone can choose or implement their most appropriate models and methods for their use. Of course, speed is appealing, so the day may come when the faster implementation will be the default.

SimonEnsemble commented 1 year ago

is it not dangerous/poor practice/ misleading to have an attribute as false in a data structure that is not false but rather missing or un-determined? this has caused a bug in our code before.

mol = smilestomol("C1=CC(=CC=C1C(=O)O)C(=O)O")

[bond.isaromatic for bond in mol.edgeattrs] # all false. incorrect. should be array of missing.

isaromatic(mol) # mix of true and false. correct.
mojaie commented 1 year ago

Still It is not documented well, nodeattrs/edgeattrs are the store of parsed SMILES/SDFile as it is, but not a calculated descriptor. The concept of design is that SMILES/SDFile encodings should be consistent with nodeattrs/edgeattrs. Bond.isaromatic means explicit aromaticity in SMILES notation rather than actual aromaticity. As mentioned above, actual aromaticity is a descriptor calculated from the molecular graph and its properties, so it depends on how we calculate the aromaticity (SMILES is a specification, so this should be consistent among all implementations though there may be many dialects; but aromaticity is a descriptor calculated by their own methods based on their own experimental design).

Please refer the test cases below to know which molecules are considered to be aromatic in this library.

https://github.com/mojaie/MolecularGraph.jl/blob/00c880e6b11b55be859bf01ee0f920ae957175b3/test/properties.jl

eahenle commented 1 year ago

image

Explicitly encoding the aromaticity in the SMILES input has no effect on the information stored in the GraphMol struct. More important, though, is the fact that by defaulting to false, Bond.isaromatic is claiming that a bond is explicitly not aromatic, when the truth is that the aromaticity has not been determined (which is why it should be missing).

Also, the test cases are all SMILES inputs with implied aromaticity. Using the isaromatic or isaromaticbond functions (which are tested) returns the correct result--but, again, that result is inconsistent with the data stored in the GraphMol for the same inputs.

mojaie commented 1 year ago

I checked the code, and found that Bond.isaromatic is not affected by explicit atom aromaticity (e.g. c), but explicit bond aromaticity (i.e :). I'm sorry for the misunderstanding. Bond aromaticity actually cannot be inferred only from atom aromaticity in some cases (e.g. biphenyl), and requires ring topology analysis (SSSR). As mentioned above, Bond/Atom.isaromatic is just a specification, not a descriptor, means explicit aromaticity in SMILES notation.

eahenle commented 1 year ago

image

That still does not change the values stored in Bond.isaromatic, and it is still incorrect either way to say that a bond is "specified" as not aromatic in SMILES (this cannot actually be done!)

mojaie commented 1 year ago

Maybe

smiles = "c1:c:c:c:c:c1"

should be

smiles = "c:1:c:c:c:c:c1"

but if this works, that means anyway ring topology analysis was performed as aromaticity validation...I wonder how we can skip it

mojaie commented 1 year ago

Yes, I finally understand that omitted bond (e.g. CC) can be both single/aromatic so parsing it as missing makes sense! (explicit single bond C-C should be false)

mojaie commented 1 year ago

smilestomol does kekulize! to assign kekulized bond notation for explicit aromatic bonds and to validate aromaticity (so, maybe c1:c:c:c:c:c1 should raise error in kekulize!, this should be added to test).

Please consider to use parse(SMILES, mol) instead of smilestomol(mol) to skip kekulization and diastereomer standardization. Note that some descriptors will not work without kekulization due to ambiguous valence information.

mojaie commented 1 year ago

In my environment (MolecularGraph.jl v0.11.0, Julia 1.8),

smiles = smilestomol("c1:c:c:c:c:c1")
[bond.isaromatic for bond in smiles.edgeattrs]
6-element Vector{Bool}:
 1
 1
 1
 1
 1
 0

and

smiles = smilestomol("c:1:c:c:c:c:c1")
[bond.isaromatic for bond in smiles.edgeattrs]
6-element Vector{Bool}:
 1
 1
 1
 1
 1
 1
eahenle commented 1 year ago

image

Confirmed that using explicit aromatic bonds in SMILES string gives the correct result; I must have made a mistake before, sorry.

Please consider to use parse(SMILES, mol) instead of smilestomol(mol)

This gives the same result as smilestomol for inputs with implicit aromaticity. Maybe this is a good place to add the option to infer aromatic bonds? i.e. parse(SMILES, smiles_string; check_aromatic=true) could allow the user to run a check for Hückel aromaticity?

explicit single bond C-C should be false

I still do not agree with this. By specifying a single or double bond, one is not stating that the bond is definitely not aromatic, because C1-C=C-C=C-C1 is still benzene and therefore aromatic. : is the only character in SMILES that specifies whether a bond is or is not aromatic--anything else is ambiguous by nature.

mojaie commented 1 year ago

Actually kekulize! in SMILES does not perform Huckel rule check (my previous explanation is wrong). Just put double bonds to the aromatic ring for some valence calculations. As you mentioned, some more low cost validation method for explicit aromaticity is important.

Explicit single bond C-C cannot be aromatic in many SMILES/SMARTS specifications as far as I know (so, a bit strange, but a strict interpretation of C=1-C=C-C=C-C1 would be a "non-aromatic Kekule benzene").

eahenle commented 1 year ago

The SMILES specification says aromaticity should be determined at input time (pg 35, subsection 4), and gives the example O=C(O)C1=CC=CC=C1 (pg 33, subsection 6). It also says there is no difference between the notation CC and C-C (pg 32, subsection 2).

So, the creator of SMILES believes that = is not a specifically non-aromatic double bond, - is not a specifically non-aromatic single bond, and C1=CC=CC=C1 is equivalent to c1ccccc1. The ability to enter a Kekulé structure and have it be recognized as (extended) Hückel aromatic considered a necessary feature to make SMILES unambiguous. It should not, for example, be possible when using a SMILES-based database system to miss records containing benzene rings due to differences in how two users chose to encode the strings.

Further, allowing - to specify a non-aromatic bond leads to some very odd input interpretations. c1-c-c-c-c-c1 is a benzene ring with no aromatic bonds, yet every atom is aromatic. C1:C:C:C:C:C1 is similarly a benzene ring with all aromatic bonds, but no aromatic atoms. It could be argued that these are simply nonsensical inputs and should be rejected by the software, but the original intent behind SMILES was to obviate these kinds of distinctions.

Refer also to web content from the SMILES creator David Weininger: https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html https://www.daylight.com/meetings/summerschool98/course/dave/smiles-convent.html https://www.daylight.com/meetings/summerschool98/course/dave/smiles-bonds.html And additional blog posts on the topic from the last two years: https://depth-first.com/articles/2020/02/10/a-comprehensive-treatment-of-aromaticity-in-the-smiles-language/ https://depth-first.com/articles/2021/06/30/writing-aromatic-smiles/

mojaie commented 1 year ago

SMILES/SMARTS is not a standard, so in this library SMILES/SMARTS parsers are mainly based on Daylight and OpenSMILES specs. I don't think these documents allow such ambiguity in -. As SMILES is not for query, this may be not so critical problem, but querying explicit single bond is very important in SMARTS (as you showed in MACCS queries in the previous issue). I think there is no reason to actively reject c1-c-c-c-c-c1 and C1:C:C:C:C:C1 even if they are strange. We can validate it in each own application as necessary.

eahenle commented 1 year ago

Yes, that use of - in SMARTS is definitely important. And I agree that those SMILES strings should not be rejected. But I still think the solution to accepting those strings, and all other representations of benzene, should result in GraphMol objects identical to that obtained with c1ccccc1, as that kind of input-to-representation convergence is a core principle in the original publication. And even if that is not the default behavior here, it should be an option when parsing SMILES inputs.

mojaie commented 1 year ago

I agree. SMILES/SMARTS is different among implementations, so the parser option that is compatible with other implementations (or the original publication) would be quite useful.