pckroon / pysmiles

A lightweight python-only library for reading and writing SMILES strings
Apache License 2.0
149 stars 21 forks source link

Misinterpretation of the ring closure bonds #34

Open jsjyhzy opened 1 year ago

jsjyhzy commented 1 year ago

Dear developers

In some cases, the marker of ring closure bonds will locate after the marker of branching, for example C(CC)1OC1 and C1(CC)OC1. OpenBabel can accept them and yield the same structure, which can be validated by command line obabel -ismi -:'C1(CC)OC1' -osmi and obabel -ismi -:'C(CC)1OC1' -osmi, the converted SMILES expressions are C1(CC)OC1.

However pysmiles acting differently as following:

>>> import pysmiles
>>> number_first = pysmiles.read_smiles('C1(CC)OC1')
>>> number_last = pysmiles.read_smiles('C(CC)1OC1')
>>> number_first.edges
EdgeView([(0, 1), (0, 3), (0, 4), (1, 2), (3, 4)])
>>> number_last.edges
EdgeView([(0, 1), (0, 3), (1, 2), (2, 4), (3, 4)])
>>> number_inner = pysmiles.read_smiles('C(CC1)OC1')
>>> number_inner.edges
EdgeView([(0, 1), (0, 3), (1, 2), (2, 4), (3, 4)])

the number last expression is somehow been misinterpreted. would the pysmiles be permissive to this condition?

pckroon commented 1 year ago

Thanks for the report.

I think, if I read the grammar specified in section 2.2 of http://opensmiles.org/opensmiles.html carefully, that C(CC)1OC1 is not valid SMILES.

Looking at:

branched_atom ::= atom ringbond branch branch ::= '(' chain ')' | '(' bond chain ')' | '(' dot chain ')'

a ringbond must always directly follow an atom specification and cannot appear after a chain.

Ideally I think pysmiles should raise an error is such a case, but I don't know when I'll have time to implement that. I'll leave the issue open until then at least.