mcs07 / PubChemPy

Python wrapper for the PubChem PUG REST API.
http://pubchempy.readthedocs.io
MIT License
381 stars 106 forks source link

How to get the PubChem similarities between two compounds? #8

Closed beyondpie closed 8 years ago

beyondpie commented 8 years ago

Hi, nice to find this package. Currently, I have multiple compounds (with SDF formats, in fact, they are from Zinc Database). Is it possible that I use their SDF formats to get their PubChem similarities ? Thanks ~ Songpeng

mcs07 commented 8 years ago

If you want similarities between compounds in an SDF file, I would recommend generating fingerprints and calculating similarities locally using RDKit (or OpenBabel, CDK, etc.). Something like:

mols = Chem.SDMolSupplier('myfile.sdf')
fp1 = AllChem.GetMorganFingerprint(mols[0], 2)
fp2 = AllChem.GetMorganFingerprint(mols[1], 2)
DataStructs.TanimotoSimilarity(fp1, fp2)

But if you specifically want to use PubChem fingerprints you can do something like this with PubChemPy:

def tanimoto(compound1, compound2):
    fp1 = int(compound1.fingerprint, 16)
    fp2 = int(compound2.fingerprint, 16)
    fp1_count = bin(fp1).count('1')
    fp2_count = bin(fp2).count('1')
    both_count = bin(fp1 & fp2).count('1')
    return float(both_count) / (fp1_count + fp2_count - both_count)

I added a more complete example here: https://github.com/mcs07/PubChemPy/blob/master/examples/Chemical%20fingerprints%20and%20similarity.ipynb

beyondpie commented 8 years ago

Great ! Yes, I also use RDKit. In this part, I only want to get the PubChem similarities. Now I see, by compound.fingerprint in your package, I can not only get the similarities, but also the PubChem fingerprints ~ Thanks a lot ! Songpeng