non relevant output smiles

erbb2 commented 2 years ago

Hi,

i was trying to train a model from chembl database. successfully trained without any issues. but when i try to optimize the model. The smiles output that generated seemed very irrelevant. ie suppose if i give valid.txt as smiles O=S(=O)(c1cccc2cnccc12)N1CCCNCC1 Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2cccnc2)n1 CO[C@H]1C[C@@H]2CCC@@H C@@(O2)C(=O)C(=O)N2CCCC[C@H]2C(=O)OC@HCC(=O)C@H/C=C(\C)C@@H C@@HC(=O)C@HCC@H/C=C/C=C/C=C/1C

the output that generates using optimize.py code is CCCCCCOc1ccc(-c2ccnnc2O)c2c1C1C(=O)CC(O)C(=O)C1N2 CC(CCNCc1ccc(O)cc1)(CCNCc1ccc(O)cc1)CPHOCC1CCCO1 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CNCCCNCCCCCNCCCCCN1CCC(C)(CCCCOc2ccc(Cn3ccnn3)cc2CN)CC1 CC(CCNCc1ccc(O)cc1)(CCNCc1ccc(O)cc1)CP(=O)(OCCOc1non+c1O)OCCN1C=CC(O)=NC1 CN CCc1ccc(Cn2nc(CNc3ccc(CC)cc3)cc2O)cc1 CN

as u can see some smiles are lengthy C's or some short CN. why such a scenario occur any idea? Like what are the possibilites such issues occur. Large dataset? cause chembl has 1.2M dataset. Also have u tried on chembl dataset?

Your help would be really appreciated. Thank you

ziqi92 commented 2 years ago

Thanks for your interest in our model!

Could you please provide us with more information about the task and the dataset you used, such as the number of pairs, the similarity you choose, and the property you want to optimize?

I will also double-check whether I upload my scripts correctly.

Best, Ziqi

erbb2 commented 2 years ago

Hi,

so I realized I face this issue while using large dataset from ChEMBL. drd2 dataset works really well. As I can see from your paper, the ChEMBL dataset has been used. So just for testing purpose can you share the dataset of ChEMBL(test.txt) and also their pairs file(train_pairs.txt). I want to check the pairs combinations generated by me. It would be really helpful, if you could provide this help.

Thank you

ziqi92 commented 2 years ago

Hi,

We did use ChEMBL dataset to build the training data for drd2 and QED. But we did not directly use all the molecule pairs extracted from ChEMBL dataset. This is because our model is for molecule optimization. You have to specify the property to be optimized and your training data must contain the corresponding patterns related to the property. If you directly use ChEMBL dataset, the training pairs will not contain any reasonable pattern, and thus the trained model can produce weird results.

Best, Ziqi

erbb2 commented 2 years ago

hi,

thank u for the reply, now it clears the doubt why am facing such issues with the new generated molecules from the trained model.

ninglab / Modof

non relevant output smiles #2