How are SMILES atoms parsed to the indices used in MolecularGraph?

Boxylmer commented 1 year ago

Our group is trying to augment SMILES to work with polymers in MolecularGraph.

To do this, we're considering adding an '&' followed by some context information to atoms that are part of the repeating unit connections, which we would handle separately after doing what we need with MolecularGraph.jl. This context info and '&' would, of course, be removed prior to being fed into the smilestomol function (as that would break it), but we'd like to keep track of the index of the resulting atom in the GraphMol object which the & was next to. I can't seem to find a good way to do this.

*Right now it looks like the indices follow the order of atoms presented in the smiles, so until I find this to not be the case, I'll assume its true.

Example pre-treated input "&C(CC)C&CC" -> We have a repeating connections at(*) index 1 and index 5. -> snip out this context for use in smilestomol -> "C(CC)CCC" -> GraphMol

Any ideas on how I could guarantee I know what indices these atoms would have?

mojaie commented 1 year ago

As you might expect, the indices follow the order of characters appear in the SMILES string. This is the correct behavior of the SMILES parser.

Boxylmer commented 1 year ago

This is incredibly helpful. Thank you for confirming, and thanks for maintaining this incredible project!

mojaie / MolecularGraph.jl

How are SMILES atoms parsed to the indices used in MolecularGraph? #84