Closed siboehm closed 2 years ago
Hi Simon,
thanks for pointing this out. I did not expect rdkit to have this circular behaviour, and I always only did canonicalisation with the above method - I will look into this and get back to you.
The conclusion we came to for now: We will perform a single Call to rdkit.CanonSmiles
for each smiles in our dataset. I will generate a lincs_trapnell_smiles.csv
, which will contain all SMILES in these datasets, transformed by a single call to rdkit.CanonSmiles
. This csv will also serve as the index of the dataframes generated by our various embeddings.
This is somewhat unsatisfying and there is probably a cleaner solution. In any case we don't want to discard the information about isomers (chirality) altogether, since it may be important and appears frequently in the dataset.
Look what I found in the rdkit source code: 😅
So it's doing exactly what I had always been doing. It's fine, I think generating a unifying csv is still a good approach.
I'm having problems with getting canonicalization to work.
Results in:
As this is a loop, we can never get to a canoncial representation. Similar things happen if I use
rdkit.Chem.MolToSmiles(rdkit.Chem.MolFromSmiles(smiles), canonical=True)
.What's the correct way to canonicalize a molecule? @MxMstrmn @M0hammadL
This leads to all kinds of indexing errors when trying to encode molecules via the GROVER embedding. I can make it work without canonicalizing, but it seems like this should be a solveable problem.