Closed dswigh closed 1 year ago
I think this would make sense! So just to confirm, you'd want an option in resolve_identifiers
that ignores stereochemistry differences?
Yea! I used the following in my own code:
# Canonicalise and remove stoichiometry
def clean_smiles(smiles):
if pd.isna(smiles):
return smiles
else:
mol = Chem.MolFromSmiles(smiles)
return Chem.MolToSmiles(mol, isomericSmiles=False) # isomericSmiles=False is what strips away the stereo info
# Apply the function to all columns in the DataFrame
df = pura_solvents.applymap(clean_smiles)
I haven't investigated fully what services/conditions cause a SMILES string to either contain, not contain, or 'explicitly be ambiguous' (ie having the crossed bond) in relation to stereochemistry.
Didn't realise the indentation would be removed by markdown... hopefully it's self-evident how the indentation should be!
Following up on the discussion we had in person. The behavior in the original post is actually expected since (e)-2-butenenitrile should resolve to C/C=C/C#N
(i.e., CIR was wrong). Therefore, we would want the consensus algorithm to say these two SMILES are different and therefore there is not sufficient agreement.
Given the molecule: (e)-2-butenenitrile PubChem will resolve to: ['C/C=C/C#N'] CIR will resolve to: ['CC=CC#N']
These two are (almost) the same SMILES strings, but Pura says they don't agree because one specifies the stereochemistry, while the other doesn't.
Perhaps a 'drop stereochemical information' arg would be a solution?