ninglab / Modof

The implementation of Modof for Molecule Optimization
Other
27 stars 10 forks source link

How to generate pairs? #1

Open erbb2 opened 2 years ago

erbb2 commented 2 years ago

First of all thank you so much for the code! Am really excited to try the model. Since I have some dataset of smiles, which I would like to use as a model for my study. I was curious how you generated the train_pairs.txt file. I came across that you need to generate them using Chemfp but am not sure how do you get these pairs. It would be really helpful if you could provide exact command line for the same. Similarly, it goes for valid.txt file. How do we generate the valid.txt file which is an input. I really appreciate your time!

Thank you

ziqi92 commented 2 years ago

Thank you for your interest in our model!

The generation of "train_pairs.txt" requires two steps:

(1) find similar pairs. Note that Modof requires similar pairs of molecules and can capture the difference between them. We must first identify similar molecule pairs from the dataset.

To get similar molecule pairs, we first generated the fingerprints of all the molecules. I thought that I may use the command “rdkit2fps" provided by Chemfp to quickly calculate the fingerprints. You can check the doc of Chemfp for more details: https://chemfp.readthedocs.io/en/latest/using-tools.html#pubchem-fingerprints.

After getting the fingerprints, I used the function below to calculate the similarity between pairs of molecules:

# The input of this function is a set of pairs of fingerprints, 
# The fp1 and fp2 below are a set of fingerprints.
def get_pairs(fingerprints, threshold=0.0):
    fp1, fp2 = fingerprints
    pairs = []
    for (query_id, hits) in chemfp.threshold_tanimoto_search(fp1, fp2, threshold=threshold):
        if len(hits) == 0: continue
        tmp = [(query_id, sim[0], sim[1]) for sim in hits.get_ids_and_scores() if sim[0] != query_id]
        pairs.extend(tmp)
    return pairs

You can split your dataset into multiple batches and run the above function in a parallel way to accelerate the calculation.

(2) identify the fragment difference. After getting similar pairs of molecules, you can use the file "preprocess.py" under the directory "data_preprocessing" to select pairs of molecules with only one fragment difference and get the edit paths and the tree representations between these pairs of molecules.

As for the "valid.txt", you can use any molecules in the dataset with poor property values, as long as these molecules are not included in the training data and test data. The "valid.txt" files I used should be provided by Jin. You can check their dataset here: https://github.com/wengong-jin/iclr19-graph2graph/tree/master/data.

Please let me know if you have any other questions.

Best, Ziqi

erbb2 commented 2 years ago

Thank you for the elaborate discussion.

Before I jump into the steps mentioned, I want to be clear if I need to look at fingerprint pairs while using rdkit2fps? will the simple one line command mentioned here works? rdkit2fps --morgan model1.sdf.gz -o model1.fps.gz

or do I need to have --pairs option given in rdkit2fps? please can you comment on this.

Thank you so much!

ziqi92 commented 2 years ago

If I remember correctly, you don't need to use the "--pairs" option.

The above one-line command you mentioned should be enough.

erbb2 commented 2 years ago

Hi,

If you don't mind could you please provide an entire python script to read and write an output to find the similarity between the pair of molecules. I been messing up and getting errors.

Thank you

erbb2 commented 2 years ago

Hi,

Since I was getting quite a few errors while using chemfp, I tried computing the fps and similarity using rdkit only. below is the code can you please say if this is the right way.?

`file_name = 'all.txt'

with open(file_name, "r") as ins: smiles = [] for line in ins: smiles.append(line.split('\n')[0]) print('# of SMILES:', len(smiles))

mols = [Chem.MolFromSmiles(smi) for smi in smiles]

print (mols)

fps = [FingerprintMols.FingerprintMol(x) for x in mols] print("# Number of fingerprints:", len(fps))

qu, ta, sim = [], [], []

compare all fp pairwise without duplicates

for n in range(len(fps)-1): # -1 so the last fp will not be used s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp print(smiles[n], smiles[n+1:]) # witch mol is compared with what group

collect the SMILES and values

for m in range(len(s)):
    qu.append(smiles[n])
    ta.append(smiles[n+1:][m])
    sim.append(s[m])

print()

build the dataframe and sort it

d = {'query':qu, 'target':ta, 'Similarity':sim}

d = {'query':qu, 'target':ta} df_final = pd.DataFrame(data=d)

df_final = df_final.sort_values('Similarity', ascending=False)

print(df_final)

save as csv

df_final.to_csv('all_pairs.txt', index=False, sep=' ')

fps.to_csv('fps.csv')

`

am even attaching the files input test.txt and output test_pairs.txt. Please suggest if this is the right way!

Thank you so much test.txt test_pairs.txt

ninglab commented 2 years ago

Hi,

We are publishing all the code for data preprocessing and all the related information soon. Hope that will help.

Thank you for your interest in our work!

Thanks, Xia

[The Ohio State University] Xia Ning, PhD Associate Professor College of Medicine Biomedical Informatics College of Engineering Computer Science and Engineering Core Faculty, Co-Director Translational Data Analytics Institute (TDAI) Computational Health & Life Sciences CoP 310C Lincoln Tower, 1800 Cannon Drive, Columbus, OH 43210 614-266-2298 Office @.**@.> / @.**@.> / https://u.osu.edu/ning.104/ Pronouns: she/her/hers

Buckeyes consider the environment before printing.

From: erbb2 @.> Reply-To: ninglab/Modof @.> Date: Thursday, November 18, 2021 at 2:39 PM To: ninglab/Modof @.> Cc: Subscribed @.> Subject: Re: [ninglab/Modof] How to generate pairs? (Issue #1)

Hi,

Since I was getting quite a few errors while using chemfp, I tried computing the fps and similarity using rdkit only. below is the code can you please say if this is the right way.?

`file_name = 'all.txt'

with open(file_name, "r") as ins: smiles = [] for line in ins: smiles.append(line.split('\n')[0]) print('# of SMILES:', len(smiles))

mols = [Chem.MolFromSmiles(smi) for smi in smiles]

print (mols)

fps = [FingerprintMols.FingerprintMol(x) for x in mols] print("# Number of fingerprints:", len(fps))

qu, ta, sim = [], [], []

compare all fp pairwise without duplicates

for n in range(len(fps)-1): # -1 so the last fp will not be used s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp print(smiles[n], smiles[n+1:]) # witch mol is compared with what group

collect the SMILES and values

for m in range(len(s)): qu.append(smiles[n]) ta.append(smiles[n+1:][m]) sim.append(s[m]) print()

build the dataframe and sort it

d = {'query':qu, 'target':ta, 'Similarity':sim}

d = {'query':qu, 'target':ta} df_final = pd.DataFrame(data=d)

df_final = df_final.sort_values('Similarity', ascending=False)

print(df_final)

save as csv

df_final.to_csv('all_pairs.txt', index=False, sep=' ')

fps.to_csv('fps.csv')

`

am even attaching the files input test.txt and output test_pairs.txt. Please suggest if this is the right way!

Thank you so much test.txthttps://urldefense.com/v3/__https:/github.com/ninglab/Modof/files/7565536/test.txt__;!!KGKeukY!gLCTc3ji25BkGbrNdCkm51FH0ynuvZ4z2lGOEfd7kGqYJMyP2mQMT01AqHwQyWEVxA$ test_pairs.txthttps://urldefense.com/v3/__https:/github.com/ninglab/Modof/files/7565537/test_pairs.txt__;!!KGKeukY!gLCTc3ji25BkGbrNdCkm51FH0ynuvZ4z2lGOEfd7kGqYJMyP2mQMT01AqHyDqHDszg$

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/ninglab/Modof/issues/1*issuecomment-973197150__;Iw!!KGKeukY!gLCTc3ji25BkGbrNdCkm51FH0ynuvZ4z2lGOEfd7kGqYJMyP2mQMT01AqHxX474HnQ$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/APDTHTC26FOCTG5FIT6SZPDUMVI5DANCNFSM5IFQOXSA__;!!KGKeukY!gLCTc3ji25BkGbrNdCkm51FH0ynuvZ4z2lGOEfd7kGqYJMyP2mQMT01AqHwVOvuK1g$. Triage notifications on the go with GitHub Mobile for iOShttps://urldefense.com/v3/__https:/apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!KGKeukY!gLCTc3ji25BkGbrNdCkm51FH0ynuvZ4z2lGOEfd7kGqYJMyP2mQMT01AqHx23ne4HA$ or Androidhttps://urldefense.com/v3/__https:/play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!KGKeukY!gLCTc3ji25BkGbrNdCkm51FH0ynuvZ4z2lGOEfd7kGqYJMyP2mQMT01AqHyr763KYA$.

ziqi92 commented 2 years ago

Hi,

Since I was getting quite a few errors while using chemfp, I tried computing the fps and similarity using rdkit only. below is the code can you please say if this is the right way.?

`file_name = 'all.txt'

with open(file_name, "r") as ins: smiles = [] for line in ins: smiles.append(line.split('\n')[0]) print('# of SMILES:', len(smiles))

mols = [Chem.MolFromSmiles(smi) for smi in smiles] #print (mols) fps = [FingerprintMols.FingerprintMol(x) for x in mols] print("# Number of fingerprints:", len(fps))

qu, ta, sim = [], [], []

compare all fp pairwise without duplicates

for n in range(len(fps)-1): # -1 so the last fp will not be used s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp print(smiles[n], smiles[n+1:]) # witch mol is compared with what group # collect the SMILES and values for m in range(len(s)): qu.append(smiles[n]) ta.append(smiles[n+1:][m]) sim.append(s[m]) print()

build the dataframe and sort it

d = {'query':qu, 'target':ta, 'Similarity':sim} d = {'query':qu, 'target':ta} df_final = pd.DataFrame(data=d) #df_final = df_final.sort_values('Similarity', ascending=False) print(df_final)

save as csv

df_final.to_csv('all_pairs.txt', index=False, sep=' ') #fps.to_csv('fps.csv') `

am even attaching the files input test.txt and output test_pairs.txt. Please suggest if this is the right way!

Thank you so much test.txt test_pairs.txt

Hi,

Sorry for my late reply. I just uploaded my scripts to github. You can check these scripts under the data_preprocessing directory.

Note that I used chemfp 1.1p1 with python 2.7. So you may need a new environment in order to use my script.

I will upload more detailed instructions later once I'm available. Please let me know if you have any other questions.

Best, Ziqi

erbb2 commented 2 years ago

Hi,

Thank you for the scripts, the get_similarity.py gives and pickle file for the pairs generated. When I try to do with open(output_path, 'rb') as h: out=pickle.load(h) it gives integer values(basically those are serial number of two smiles compared and their similarity score?) and not smiles how can I get the pair smiles?, can u comment on this if possible. would be really helpful.

Thank you

ziqi92 commented 2 years ago

Hi,

Please take a look at the comments I wrote at the beginning of the file "get_smilarity.py". In the file you used to get fingerprints, each smiles string should be associated with an index. You can get the smiles corresponding to the index from your file used to get fingerprints.

Sorry for the late reply. Please let me know if you have any other questions.

Best, Ziqi