sustainable-processes / pura

Clean chemical data quickly
MIT License
10 stars 3 forks source link

Disagreement due to stereochemical SMILES #45

Closed dswigh closed 1 year ago

dswigh commented 1 year ago

Given the molecule: (e)-2-butenenitrile PubChem will resolve to: ['C/C=C/C#N'] CIR will resolve to: ['CC=CC#N']

These two are (almost) the same SMILES strings, but Pura says they don't agree because one specifies the stereochemistry, while the other doesn't.

Perhaps a 'drop stereochemical information' arg would be a solution?

marcosfelt commented 1 year ago

I think this would make sense! So just to confirm, you'd want an option in resolve_identifiers that ignores stereochemistry differences?

dswigh commented 1 year ago

Yea! I used the following in my own code:

# Canonicalise and remove stoichiometry
def clean_smiles(smiles):
    if pd.isna(smiles):
        return smiles
    else:
        mol = Chem.MolFromSmiles(smiles)
        return Chem.MolToSmiles(mol, isomericSmiles=False) # isomericSmiles=False is what strips away the stereo info

# Apply the function to all columns in the DataFrame
df = pura_solvents.applymap(clean_smiles)

I haven't investigated fully what services/conditions cause a SMILES string to either contain, not contain, or 'explicitly be ambiguous' (ie having the crossed bond) in relation to stereochemistry.

dswigh commented 1 year ago

Didn't realise the indentation would be removed by markdown... hopefully it's self-evident how the indentation should be!

marcosfelt commented 1 year ago

Following up on the discussion we had in person. The behavior in the original post is actually expected since (e)-2-butenenitrile should resolve to C/C=C/C#N (i.e., CIR was wrong). Therefore, we would want the consensus algorithm to say these two SMILES are different and therefore there is not sufficient agreement.