pckroon / pysmiles

A lightweight python-only library for reading and writing SMILES strings
Apache License 2.0
144 stars 21 forks source link

write_smiles can create invalid SMILES when provided with chemically invalid graphs #17

Open craldaz opened 3 years ago

craldaz commented 3 years ago

For some reason my graph is returning SMILES for aromatic groups that uses aromatic bond symbols e.g. NC:1:N:N:C:[N]1N.

RDKit does not recognize these symbols and it removes all the aromaticity to produce NC1NNCN1N, and openbabel produces the same result.

Some have speculated that its a smarts string

https://mattermodeling.stackexchange.com/questions/4981/how-to-canonicalize-smiles-written-with-aromatic-bond-symbols

others just say it's wrong.

https://github.com/openbabel/openbabel/issues/2368

Do you know what is going on?

Thanks for your help!

pckroon commented 3 years ago

It's complicated :) See also section 3.5 from http://opensmiles.org/opensmiles.html, and let me highlight the following sentence:

The aromatic-bond symbol ':' can be used between aromatic atoms, but it is never necessary; a bond between two aromatic atoms is assumed to be aromatic unless it is explicitly represented as a single bond '-'. However, a single bond (nonaromatic bond) between two aromatic atoms must be explicitly represented.

How did you generate your graph? The write_smiles function does minimal chemical interpretation of your graph to avoid guessing wrong. All it does is remove explicit hydrogens (where able). To mark aromatic regions in your molecule, represent them with lowercase element symbols (e.g. Nc1nncn1N). Pysmiles does provide a helper function for this (correct_aromatic_rings), but deciding what is or is not aromatic is a surprisingly nontrivial problem, in particular once extracyclic atoms need to be taken into account. There's a fairly detailed description on what this function does in the readme.

I hope this helps, or at least provides you with a workaround...

PS. thanks for the SE link, it's an interesting discussion.

edit: PPS: I agree with the assessment that pysmiles produced an invalid SMILES in this case. I'm debating whether I'll fix this, or whether it's better to leave write_smiles as a dumb serializer --- your graph is also chemically invalid, so I'm kind of ok with the resulting SMILES to also be chemically invalid. The roundtrip graph -> write_smiles -> read_smiles -> graph should always produce the exact same graph, whether it makes chemical sense or not.

fgrunewald commented 1 month ago

@pckroon this should be fixed now. In fact there was a bug in the writer but I sneaky fixed that with the aromatic overhaul.