How to use a trained model to predict a new data?

Running-z commented 5 years ago

I modified your source code, added model.save() at the end of the training, saved the model file, I ran your drive_d.sh file and got the model file model.pickle Now I have a molecular data of smiles and a protein sequence. I want to use the trained model to predict the force of these molecules on this protein sequence. How should I handle my data and how to call your method? Can you predict? You only have some bash files, I don't know how to make this, please give me some methods, thank you!

simonfqy commented 5 years ago

Hi, if you want to predict the forces between compound molecules and protein, you should train the model using the data of forces. In my implementation, Davis, Metz, KIBA and NCI60 data are all not forces.

PADME can predict the interactions between previously unseen compounds and proteins using the model already trained. I have just uploaded the bash files drive4_nci60.sh and drive_nci60.sh. Those two bash files are similar, they both use the trained model specified by --model_dir (you can change the subsequent model directory as you wish) and predict the interactions between compounds and proteins by specifying the --predict_only parameter and a --csv_out parameter which specifies the file name of the output file.

Note that to predict, you must have a file with all the compound-protein (drug-target) pairs. Imagine a file similar to restructured.csv in metz_data/ and davis_data/ folders, from which the program reads the drug-target pairs, but not the interaction values. The program would then output the predicted interaction strengths for those drug-target pairs in an output file whose path is specified following --csv_out.

Running-z commented 5 years ago

@simonfqy Ok, thank you, I am going to try to make predictions based on my data, thank you for your guidance.

Running-z commented 5 years ago

@simonfqy Hello, I tried to run your drive_nci60.sh today, because there is no nci60 data, so I changed the dataset in drive_nci60.sh to davis, but I found that the program started training instead of predicting. After training, I found out There is no --csv_out file, so I am not predicting, but training

Then I saw that the parsing parameters in your drive.py are not --predict_only and --csv_out, I added these two parameters in drive.py default

But I found that the --predict_only and --csv_out values are still useless. I still can't get the prediction result. I forced the predict_only and csv_out in the source code, then passed the saved model file to --model_dir, and then I got another one. error:

Traceback (most recent call last):
  File "driver.py", line 674, in <module>
    tf.app.run(main=run_analysis, argv=[sys.argv[0]] + unparsed)
  File "/home/zh/anaconda3/envs/deep2.0.0/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "driver.py", line 266, in run_analysis
    aggregated_tasks=aggregated_tasks)
  File "/project/git2/PADME/dcCustom/molnet/run_benchmark_models.py", line 198, in model_regression
    model.predict(train_dataset, transformers=transformers, csv_out=prediction_file, tasks=tasks)
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 643, in predict
    self.restore()
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 1050, in restore
    raise ValueError('No checkpoint found')

But actually this file is there

So, where am I wrong? What should I do in order to really load my trained model to predict my data?

simonfqy commented 5 years ago

Hi, I am very sorry for forgetting to update the files in the GitHub repo. This has created troubles for you. I have now updated the files including the driver.py script, so you can see the --predict_only and --csv_out have already been implemented. Those parameters are passed on to the functions that driver.py calls. When you're predicting, you must have a model directory containing the model already trained. For example, you can refer to drive4_d.sh file which specifies the --model_dir parameter. It automatically stores the model files inside this directory in the training process. You should also specify the --model_dir when you are predicting on new data using that trained model. Note that the drive4_d.sh script actually does cold target cross-validation, which I have referred to in my paper. You can use drive4_d_warm.sh to conduct warm model training and validation.

Hope this helps. Please don't hesitate to ask me, should you have any more inquiries.

Running-z commented 5 years ago

@simonfqy Thank you for your reply again. I updated your code again, I used davis data, changed --model to graphconvreg, then I modified your np_epoch to 1 , but I got the following error after training:

Traceback (most recent call last):
  File "driver.py", line 696, in <module>
    tf.app.run(main=run_analysis, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "driver.py", line 378, in run_analysis
    train_score = train_scores_list[h]
IndexError: list index out of range

But I got the model_dir file, which contains the checkpoint file, I will take it to predict my data first.My protein data is the protein of the JAK2 target, but if I modify the --dataset in the frive4_m_warm.sh to JAK2, I get the following error:

Traceback (most recent call last):
  File "driver.py", line 696, in <module>
    tf.app.run(main=run_analysis, argv=[sys.argv[0]] + unparsed)
  File "/home/zh/anaconda3/envs/deep2.0.0/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "driver.py", line 152, in run_analysis
    if not split in [None] + CheckSplit[dataset]:
KeyError: 'JAK2'

So my dataset is not in the range of the shoe dataset you defined, I can't train or predict it. Do I need to write a dataset loader myself?

Next, I deleted all the files in davis_data and replaced my own prot_desc.csv and restructured.csv files, but the data could not be loaded normally. This is what my data looks like.

jak2

Finally, I changed my data to the format of davis data, like the following: jak2

jak2

Then my data can be loaded, but when I predict my data I get the following error:

Traceback (most recent call last):
  File "driver.py", line 696, in <module>
    tf.app.run(main=run_analysis, argv=[sys.argv[0]] + unparsed)
  File "/home/zh/anaconda3/envs/deep2.0.0/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "driver.py", line 278, in run_analysis
    prediction_file=csv_out)
  File "/project/git2/PADME/dcCustom/molnet/run_benchmark_models.py", line 194, in model_regression
    model.predict(train_dataset, transformers=transformers, csv_out=prediction_file, tasks=tasks)
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 643, in predict
    predictions = self.predict_on_generator(generator, transformers, outputs)
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 572, in predict_on_generator
    return self._predict(generator, transformers, outputs, False)
  File "/project/git2/PADME/dcCustom/models/tensorgraph/tensor_graph.py", line 534, in _predict
    feed_results = self.session.run(tensors, feed_dict=feed_dict)
  File "/home/zh/anaconda3/envs/deep2.0.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/home/zh/anaconda3/envs/deep2.0.0/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1104, in _run
    % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (128, 1022) for Tensor 'Feature_38/Placeholder:0', which has shape '(?, 8421)'

This question is because my model did not complete the error when it was trained?

So I still can't predict my drug and protein data correctly.The key is that the name in my prot_desc.csv data can't be customized? Also, the proteinName and protein_dataset attributes in my restructured.csv data can't be customized?

Sorry, I may have too many problems, but this is a problem I encountered in actual use. I still hope that you will give me guidance, thank you very much.

simonfqy commented 5 years ago

Hi @Running-z . I am a bit confused regarding your first question. Are you using the current version of the repo? You can refer to drive_d.sh file for a process using graphconvreg. Actually all those .sh files with 4 included in the file names correspond to ECFP models, all those without numbers indicate graphconvreg models. If this paragraph does not answer your question, please provide the content of your .sh file.

I am not quite sure whether you followed the correct way to generate the PSC descriptors. I didn't do it myself, my colleague Evgenia did it, using a python script to generate the PSC descriptors by calling propy package functions, which I did not include in the repo. It utilizes Python 2. Make sure that you have sorted the columns in lexicographical order (A, AA, ...).

One important detail you've forgotten is that there is an additional entry(column) with binary values indicating phosphorylation. Clearly you've left it out. We manually added that entry, with a 1 indicating phosphorylated while a 0 indicating otherwise, resulting in a PSC descriptor with 8421 entries. This is because the Davis dataset has some proteins which are only different in phosphorylation. If your dataset does not indicate it explicitly, then just use fill in that entry with a 0.

Running-z commented 5 years ago

@simonfqy Thank you for your answer. After checking my data repeatedly, I found out that this is indeed the problem. After I added the phosphorylation attribute, I can predict it normally. Thank you. At the same time, I have a suggestion. When forecasting, we often need to predict all. Data, get the prediction results of all the data, not part, so I think you can modify your code, split can be specified as None, just like in deepchem, of course, this is just a suggestion

simonfqy commented 5 years ago

@Running-z Thanks. If you remove the --cross_validation, --cold_drug, --cold_target or --split_warm parameters from the .sh script like what I did for drive_nci60.sh, you would get the predictions for all drug-target pairs, no splits.

Note that those split schemes are only for training and validation. If you are predicting for new drug-target pairs for which the true interaction strengths are unknown, you should remove those parameters.

Running-z commented 5 years ago

@simonfqy Ok, thank you very much, have you ever thought about unifying these data processing functions into one, for example, restructured.csv is also passed like --prot_desc_path davis_data/prot_desc.csv, so you don't need to separate each data. Write a data processing function, I just need to pass the corresponding parameters according to the data.

simonfqy commented 5 years ago

@Running-z I think it is a good point. I will do this in a couple of days. It is not so straightforward as it seems, because the existing implementation follows the way in DeepChem. To add your requested feature I must unify the functions like load_davis(), load_kiba(), etc.

Running-z commented 5 years ago

@simonfqy Okay thank you

simonfqy commented 5 years ago

@Running-z I believe I have been very patient and responsible regarding your questions and requests. If your research results in a publication, it would be very nice if you could put me in the Acknowledgement section. Thank you very much!

Bigrock-dd commented 4 years ago

@simonfqy Okay thank you

Sorry,could you help me solve such question?thanks!!!

simonfqy commented 4 years ago

@Bigrock-dd I had some problems with my supervisor. I did not even get my full funding as promised in the offer letter, and this research project has been unable to get published in a peer-reviewed journal. And more importantly, I just did this project to get an MSc degree, I don't even take an interest in it (in fact, this is not even the project I liked. My co-supervisor did not let me do the project I wanted for some administrative reasons. If I did the other project I wanted, I would have published a paper already), or chemistry as a whole. I am now graduated, working in the industry, and NEVER want to work in CS or Chemistry related academia. I hate academia now. My MSc experience has been absolutely toxic. So the answer is, sorry, although this modification will be nice, I don't think it will give me much benefit considering the amount of work required. My life has shifted its focus now. However, if you want it, you can try it yourself or ask some friends/senior students to help you. You're always welcome to publish a Pull Request, and I can review it. And also, my experience is my experience only, it should not affect your choices in any way.

ecom-research commented 4 years ago

@simonfqy ... Sad to hear that. I think, for a Master thesis, you did a great job in writing the paper and this repo. If I had a student like you, I would be happy. Probably, your supervisor had some limitations. Just wanted to let you know, you did a great job! Take care and be happy in whatever you do in life! :)

simonfqy commented 4 years ago

@ecom-research Thank you for your kind words! And I sincerely wish you all the best, especially in the current period of uncertainty.

simonfqy commented 1 year ago

Eventually I launched a legal claim against my supervisor and successfully got back some of my owed funding. The story and learnings can be found in https://www.antonfeng.me/martin-esters-breach-of-law-and-lessons-i-learned, which contains both the learnings and the original demand letter I sent to the professor.

I strongly recommend any graduate students or prospective graduate students to read it.

I also have a medium version of the article, which doesn't contain the appendix where the demand letter resides. Otherwise, it's mostly similar.

simonfqy / PADME

How to use a trained model to predict a new data? #5