kexinhuang12345 / DeepPurpose

A Deep Learning Toolkit for DTI, Drug Property, PPI, DDI, Protein Function Prediction (Bioinformatics)
https://doi.org/10.1093/bioinformatics/btaa1005
BSD 3-Clause "New" or "Revised" License
982 stars 273 forks source link

how to apply model #122

Closed sailseem closed 3 years ago

sailseem commented 3 years ago

Dear Dr.huang, Thank you for providing such a great package, well, i am quite new to all of this. Assumed that i got the well trained model for drug and target binding based on bindingDB database. And, i can upload like 100 drugs smiles by using dataset.read_file_target_sequence (drugs.txt), at the same time, upload 100 targets using same dataset.read_file_target_sequence (targets.txt). But, how to apply this to get a possible binding score of each pair?
like "model.predict(drug,targets)" but this returns TypeError: predict() takes 2 positional arguments but 3 were given

I am sorry for your time to look at such a silly question, but thanks a lot, anyway Bests, fan

kexinhuang12345 commented 3 years ago

Hi, it needs to be fed into the data_process function first:

data = data_process(X_drug, X_target, y, 
                                drug_encoding, target_encoding, 
                                split_method='no_split')
kexinhuang12345 commented 3 years ago

You can also use oneliner mode if you are only interested in getting prediction result: https://github.com/kexinhuang12345/DeepPurpose/blob/master/DEMO/case-study-II-Virtual-Screening-for-BindingDB-IC50.ipynb

sailseem commented 3 years ago

It's getting confused, because we dont have a y value, how to process this with input y. this is what we want to know

kexinhuang12345 commented 3 years ago

Oh sorry, my bad. Yeah, in this case, you should use either the oneliner mode or to load a pretrained model and then call _ = DTI.virtual_screening(drug, target, model, drug_name, target_name) where drug/target can be string or a list of strings (SMILES, Target sequence)

sailseem commented 3 years ago

all right, but let's say i got 3 drugs and one protein drug3, drug_name, target, target_name = ['Cc1cnc2c(NCCN)nc3ccc(C)cc3n12','Oc1cccc(c1)-c1nc(N2CCOCC2)c2oc3ncccc3c2n1','CC1(C)CNc2cc(NC(=O)c3cccnc3NCc3ccncc3)ccc12','OC[C@H]1OC@@H[C@H]2O)C@HC@@H[C@H]1O'], ['no1','no2','no3'], ['MLGRNTWKTSAFSFLVEQMWAPLWSRSMRPGRWCSQRSCAWQTSNNTLHPLWTVPVSVPGGTRQSPINIQWRDSVYDPQLKPLRVSYEAASCLYIWNTGYLFQVEFDDATEASGISGGPLENHYRLKQFHFHWGAVNEGGSEHTVDGHAYPAELHLVHWNSVKYQNYKEAVVGENGLAVIGVFLKLGAHHQTLQRLVDILPEIKHKDARAAMRPFDPSTLLPTCWDYWTYAGSLTTPPLTESVTWIIQKEPVEVAPSQLSAFRTLLFSALGEEEKMMVNNYRPLQPLMNRKVWASFQATNEGTRS'], ['protein']

this function only return one pair binding score _ = DTI.virtual_screening(drug3, target, model, drug_name, target_name)

virtual screening... Drug Target Interaction Prediction Mode... in total: 4 drug-target pairs encoding drug... unique drugs: 4 encoding protein... unique target sequence: 1 Done. predicting...

Virtual Screening Result +------+-----------+-------------+---------------+ | Rank | Drug Name | Target Name | Binding Score | +------+-----------+-------------+---------------+ | 1 | no1 | protein | 7.51 | +------+-----------+-------------+---------------+

sailseem commented 3 years ago

image

sailseem commented 3 years ago

and funny thing is that no matter what i input as drug smiles, always return to 7.51, its all related to protein sequence

kexinhuang12345 commented 3 years ago

for virtual screening mode, you should put target as three same protein sequence as a list, instead of just one. So

['MLGRNTWKTSAFSFLVEQMWAPLWSRSMRPGRWCSQRSCAWQTSNNTLHPLWTVPVSVPGGTRQSPINIQWRDSVYDPQLKPLRVSYEAASCLYIWNTGYLFQVEFDDATEASGISGGPLENHYRLKQFHFHWGAVNEGGSEHTVDGHAYPAELHLVHWNSVKYQNYKEAVVGENGLAVIGVFLKLGAHHQTLQRLVDILPEIKHKDARAAMRPFDPSTLLPTCWDYWTYAGSLTTPPLTESVTWIIQKEPVEVAPSQLSAFRTLLFSALGEEEKMMVNNYRPLQPLMNRKVWASFQATNEGTRS'] * 3

alternatively, use repurposing mode

sailseem commented 3 years ago

Thanks, one more question. Cause you choose the kd value from bindingDB as input? Normally, the lower Kd means the high affinity. What's the meaning of the binding score? Same as Kd? or they are different? What's the normal range of binding score? How big or small should consider as a good binding affinity? Thanks

kexinhuang12345 commented 3 years ago

Hi, yes, it is all depended on the training data. If the training data is Kd, then the inference value is all in Kd. I think in the one-liner mode, it is Kd, so lower the better. You can also transform it to pKd by setting convert_y. Note there are also a couple of models that are in IC50.

sailseem commented 3 years ago

sorry to re-confirm this, by using the pre-trained model like virtual screen, the binding score was ranking from high to low, feels like, the top score equals to best affinity, why you chose that way to interpret data?