pckroon / pysmiles

A lightweight python-only library for reading and writing SMILES strings
Apache License 2.0
149 stars 21 forks source link

R/S chirality #15

Open pckroon opened 4 years ago

pckroon commented 4 years ago

RIght, so in openSMILES chirality is very elaborate [1]. I suggest supporting only tetrahedral chiralities, i.e. @ and @@ specifications, and later on E/Z chirality. This issue will be just about the tetrahedral R/S chirality.

The default case is fairly easy to cover since pysmiles guarantees that the node indices in the final molecule are in the same order as the atom entries in the input SMILES. openSMILES says to look along the axis from the "first" atom to the chiral center, and see whether the other atoms go clockwise (@@) or counter clockwise (@). The complication arises from rings: the order of the bonds around the central atom are what matters according to openSMILES. In this case the node indices in the resulting graph do not necessarily correspond to the chiral order. Consider the following cases, where x is the chiral center with 4 different substituents (a, b, c, d), and .. means "any number of different atoms in between".

a[x@](b)(c)d  # Default case 0
b1..a[x@]1(c)d  # case 1
a[x@]1(c)(d)..b1  # case 2
a[x@]21(d)..c1..b2  # case 3
[x@]12cd..a1..b2  # case 4

Finally, implicit hydrogens are considered the "first" atom, but we can shove this under the carpet for now.

This results in the following

I suggest lumping cases 1-4 together for sake of simplicity. This means when parsing we'll need to keep track of all atoms which are involved with ring bonds so we can post-process them. In addition, we'll need to track in which order (chiral) atoms have ring bonds, and once these edges are added to the graph, which node indices they correspond to (because ring indices can (and should) be reused).

So, concretely:

An open question is still how to represent the chirality at the graph level. I suggest sticking to the opensmiles @ @@ idea, which the exception that the order of the node inidices of the neighbours is determining (i.e. the neighbour with the lowest node index is the first, the one with highest last). We can then provide helper functions to translate that to R/S (and the other way). I think this is the most natural way of exposing it:

Open question is also how strict we want to be here: is it an error to provide a chiral specifier on a non-tetrahedral atom? What about symmetric substituents? In other words, what if I provide a chiral specifier for a non-chiral atom?

Lastly, careful thought is required for the SMILES writer, but I think we can push that back a little bit.

@Eljee thoughts?

[1] http://opensmiles.org/opensmiles.html#chirality [2] https://github.com/pckroon/pysmiles/blob/master/pysmiles/read_smiles.py#L128

craldaz commented 3 years ago

I am wondering if you have made any progress on this. The ability to write_smiles with chirality would be very useful.

Represing chirality is straightforward at the graph level as node and edge attributes. For example, as is done in openforcefield.

$ from openforcefield.topology import Molecule
$ mol = Molecule.from_smiles('C/C=C/[C@@H](C)F')
$ nxgraph = mol.to_networkx()
$ nx.get_node_attributes(nxgraph,'stereochemistry')
{0: None, 1: None, 2: None, 3: 'R', 4: None, 5: None, 6: None, 7: None, 8: None, 9: None, 10: None, 11: None, 12: None, 13: None, 14: None}
$ [nxgraph.get_edge_data(*edge) for edge in list(nxgraph.edges())]
[{'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 2, 'is_aromatic': False, 'stereochemistry': 'E'}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}, {'bond_order': 1, 'is_aromatic': False, 'stereochemistry': None}]

I have no idea how write_smiles works tho and how it could be modified to add chiral atoms.

Interestingly in openforcefield documentation they also desire a from_graph function.

https://open-forcefield-toolkit.readthedocs.io/en/topology/api/generated/openforcefield.topology.Molecule.html#openforcefield.topology.Molecule.to_networkx

pckroon commented 3 years ago

I haven't had time to look into this further unfortunately :(

Adding an attribute to the atoms/bonds is easy, but that's only part of the story. We also need to define the meaning/semantics of this attribute, and make sure we can read and write this correctly in all possible edge cases.

The writeup above tried to identify the pros/cons of using @/@@ vs R/S, as well as edge cases. I do need more input on whether these are complete, and on whether @/@@ or R/S would be more convenient.