greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.25k stars 270 forks source link

CGBVS-DNN: Prediction of Compound-protein Interactions Based on Deep Learning #117

Open kumardeep27 opened 7 years ago

kumardeep27 commented 7 years ago

http://doi.org/10.1002/minf.201600045 (edited with link)

agitter commented 7 years ago

Abstract:

Computational prediction of compound-protein interactions (CPIs) is of great importance for drug design as the first step in in-silico screening. We previously proposed chemical genomics-based virtual screening (CGBVS), which predicts CPIs by using a support vector machine (SVM). However, the CGBVS has problems when training using more than a million datasets of CPIs since SVMs require an exponential increase in the calculation time and computer memory. To solve this problem, we propose the CGBVS-DNN, in which we use deep neural networks, a kind of deep learning technique, instead of the SVM. Deep learning does not require learning all input data at once because the network can be trained with small mini-batches. Experimental results show that the CGBVS-DNN outperformed the original CGBVS with a quarter million CPIs. Results of cross-validation show that the accuracy of the CGBVS-DNN reaches up to 98.2 % (σ<0.01) with 4 million CPIs.

agitter commented 7 years ago

I spent a few minutes looking at this and see some potential red flags. The construction of the training dataset is substantially different than related work (e.g. #55). I'm interested to see what you think @kumardeep27.

kumardeep27 commented 7 years ago

This is an advanced version (based on deep neural networks (DNN)) of chemical genomics based virtual screening (authors' previous method based on SVM) to calculate compound-protein interactions (CPIs) for millions of datasets. This framework can handle millions of datasets with improved performances as compared to SVM based framework. CGBVS enables the calculation of CPIs without the requirement of 3D structures. Input is the descriptors of both ligands and protein. Main aim of the paper is to make a framework capable of handling millions of datasets (4 mn). DNN is useful in big data as it does not require all input data in one go, rather it can take in mini-batches making the system to take not excessive memory requirements. Dataset More than 2 mn positive interaction data of GPCR family is obtained from GVK-BIO and ChEMBL. Negative data was generated artificially. Negative data is randomly generated from shuffling the positive data's protein and compound combination which are absent in positive datset (Figure 1). Both the data were combined and split in 80:20 ratio with equal presentation in trainign and testing respectively. Feature space include:1974 feature vector (894 for chemical descriptors with scale of -1 to 1 (Dragon) + 1080 for protein descriptors (PROFEAT)).

The deep belief netwokr (DBN) framework is used where 1974 input units in the input layers are followed by hidden layers with restricted Boltzmann machines (RBMs) for the purpose of unsupervised pre-training. The final hidden layer is attached to output layer with the logistic regression giving binary output. Network is fine-tuned for all the parameters. To accomodate the large data, CPU was chosen to be optimized for the Theano. Authors have chosen CPU over GPU due to the big dataset and they have modified the Theano code to make it suitable for the CPU to run effienciently. One of the main feat of the work is optimization of Theano for the CPU. Authors did the thread-level and instruction-level parallelism, performed manual assembly, vectorization of loops and software prefetch.

4 types of evaluation was done, the performance is given for testing set(20% split), 5-fold CV was done and performance is reported in terms of accuracy: 1) Different structures of DNN with small-scale dataset 2-9 layers with 1000,2000 and 3000 units per layers were tested and overall 2000/3000 unit layered were found to be best performing. finally 3 layer architecture with 2000-units was considered to be the best. 2) pre training vs non pre training Fine tuning without pre-training was found to achieve better accuracy of 91.4% on the dataset of quarter of a million CPIs. 3) medium scale vs large scale dataset On the large datasets, the optimized CPU based theano performed faster than the GPU hardware in both pre-training and fine-tuning. 4) comparison of the hyperparameters e.g mini-batch size and learning rate. No significant change in accuracy for in-order and shuffled mini-batches (order was shuffled in an epoch), but for testing data there was increased performance. Batch sizes from 5,10,20,40 and 80 were compared and batch 10 was chosen to be best due to good accuracy and short training time. Learning rate of 0.2 showed best accuracy.

agitter commented 7 years ago

@kumardeep27 Is your understanding that the GVK-BIO data they use is private? The reference in their paper is to the website http://www.gvkbio.com/. I'm trying to better understand how this training dataset differs from those used in the ligand-based approaches we have discussed.

Also, did you see whether they discuss how well the assumptions they make when generating negative instances hold in practice? They study G-protein coupled receptors, so this set of proteins is presumably much more similar than the full set of proteins in ChEMBL and could have some similarity in active compounds.

The previous paper from this group (http://doi.org/10.1038/msb.2011.5) does a good job of distinguishing different virtual screening approaches in Figure 1. We have mostly discussed ligand-based systems so far that train only on compound features. In contrast, this method models compound–protein interactions and featurizes both compounds and target proteins.

kumardeep27 commented 7 years ago

I could not find the link to download the dataset from the GVK-BIO website. I have emailed the corresponding author for the same. In the absence of the real time negative data (which is in most of the studies or due to the lack of reporting the negative results), the negative dataset are generated by the random combinations. Here they have generated the negative data by similar approach and that too with in the same dataset which makes the classification more rigorous. Yes, you are right that GPCR protiens might be more similar than the full protein set. This work includes more information of the binding partners (i.e. proteins) apart from the chemical space (i.e. compounds) which makes the approach more rational as the correct combination of the binding partners would be identified in a better way.

taneishi commented 7 years ago

I'm an implementor of this method. As you wrote, the data I used in this paper is private one. I published example scripts and data at my repository (https://github.com/ktaneishi/DBN-Kyoto) for benchmark use.

agitter commented 7 years ago

Thanks for commenting @ktaneishi. I still want to look at your paper more closely, but in the meantime perhaps you can help answer a question I had.

Since you are studying the GPCR family, did you check how similar the known positive compound-protein interactions are for different GPCR proteins? Potentially if proteins A and B have similar positive interactions within the set of compounds tested on both, when we observe <A, compound 1> as positive and <B, compound 1> as untested, it may be plausible that <B, compound 1> would be positive if we were to test it.

I've seen this negative set generation strategy used before, e.g. in protein-protein interaction prediction, but in that domain there are typically few common interaction partners for a randomly selected pair of proteins. I'm curious whether that holds in this compound-protein interaction domain as well.

taneishi commented 7 years ago

Thanks for your reply @agitter. For your question, it's important to consider similarity of known ligands when we generate negative set, as you pointed. For example, some GPCRs have many ligands and some have a little or nothing, it causes a serious bias when we build the model. I used unbiased generated negative set in this experiments, as same as the former work (http://doi.org/10.1038/msb.2011.5).

agitter commented 7 years ago

Thanks again @ktaneishi I'll look to the MSB paper for details.