theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
https://arxiv.org/abs/2204.13545
MIT License
99 stars 24 forks source link

Figure out canoncialization #30

Closed siboehm closed 2 years ago

siboehm commented 2 years ago

I'm having problems with getting canonicalization to work.

print(Chem.CanonSmiles("N[C@]12C[C@H]3C[C@@H](C[C@H](C3)C1)C2.N[C@]12C[C@H]3C[C@@H](C[C@H](C3)C1)C2"))
print(Chem.CanonSmiles(Chem.CanonSmiles("N[C@]12C[C@H]3C[C@@H](C[C@H](C3)C1)C2.N[C@]12C[C@H]3C[C@@H](C[C@H](C3)C1)C2")))
print(Chem.CanonSmiles(Chem.CanonSmiles(Chem.CanonSmiles("N[C@]12C[C@H]3C[C@@H](C[C@H](C3)C1)C2.N[C@]12C[C@H]3C[C@@H](C[C@H](C3)C1)C2"))))

Results in:

N[C@]12C[C@@H]3C[C@H](C[C@H](C3)C1)C2.N[C@]12C[C@@H]3C[C@H](C[C@H](C3)C1)C2
N[C@]12C[C@H]3C[C@@H](C[C@H](C3)C1)C2.N[C@]12C[C@H]3C[C@@H](C[C@H](C3)C1)C2
N[C@]12C[C@@H]3C[C@H](C[C@H](C3)C1)C2.N[C@]12C[C@@H]3C[C@H](C[C@H](C3)C1)C2

As this is a loop, we can never get to a canoncial representation. Similar things happen if I use rdkit.Chem.MolToSmiles(rdkit.Chem.MolFromSmiles(smiles), canonical=True).

What's the correct way to canonicalize a molecule? @MxMstrmn @M0hammadL

This leads to all kinds of indexing errors when trying to encode molecules via the GROVER embedding. I can make it work without canonicalizing, but it seems like this should be a solveable problem.

MxMstrmn commented 2 years ago

Hi Simon,

thanks for pointing this out. I did not expect rdkit to have this circular behaviour, and I always only did canonicalisation with the above method - I will look into this and get back to you.

siboehm commented 2 years ago

The conclusion we came to for now: We will perform a single Call to rdkit.CanonSmiles for each smiles in our dataset. I will generate a lincs_trapnell_smiles.csv, which will contain all SMILES in these datasets, transformed by a single call to rdkit.CanonSmiles. This csv will also serve as the index of the dataframes generated by our various embeddings.

This is somewhat unsatisfying and there is probably a cleaner solution. In any case we don't want to discard the information about isomers (chirality) altogether, since it may be important and appears frequently in the dataset.

siboehm commented 2 years ago

Look what I found in the rdkit source code: 😅

image https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/rdkit/Chem/__init__.py#L55-L57

So it's doing exactly what I had always been doing. It's fine, I think generating a unifying csv is still a good approach.