masashitsubaki / CPI_prediction

This is a code for compound-protein interaction (CPI) prediction based on a graph neural network (GNN) for compounds and a convolutional neural network (CNN) for proteins.
Apache License 2.0
159 stars 36 forks source link

Compound-protein interaction (CPI) prediction using a GNN for compounds and a CNN for proteins

_Important: this repository will not be further developed and maintained because we have shown and believe that graph neural networks or graph convolutional networks are incorrect and useless for modeling molecules (see our paper in NeurIPS 2020). Please consider switching to our new and simple machine learning model called quantum deep field._

This code is an implementation of our paper "Compound-protein Interaction Prediction with End-to-end Learning of Neural Networks for Graphs and Sequences (Bioinformatics, 2018)" in PyTorch. In this repository, we provide two CPI datasets: human and C. elegans created by "Improving compound–protein interaction prediction by building up highly credible negative samples (Bioinformatics, 2015)." Note that the ratio of positive and negative samples is 1:1.

In our problem setting of CPI prediction, an input is the pair of a SMILES format of compound and an amino acid sequence of protein; an output is a binary label (interact or not). The SMILES is converted with RDKit and we obtain a 2D graph-structured data of the compound (i.e., atom types and their adjacency matrix). The overview of our CPI prediction by GNN-CNN is as follows:

The details of the GNN and CNN are described in our paper. Note that this implementation is a simpler than the model proposed in our original paper (e.g., without edge vectors and their updates described in Eqs (5) and (6)).

In addition, the above CPI prediction uses our proposed GNN, which is based on learning representations of r-radius subgraphs (i.e., fingerprints) in molecules. We also provide an implementation of the GNN for predicting various molecular properties such as drug efficacy and photovoltaic efficiency in https://github.com/masashitsubaki/GNN_molecules.

Characteristics

Requirements

Usage

We provide two major scripts:

(i) Create the tensor data of CPIs with the following command:

cd code
bash preprocess_data.sh

The preprocessed data are saved in the dataset/input directory.

(ii) Using the preprocessed data, train the model with the following command:

bash run_training.sh

The training and test results and the model are saved in the output directory (after training, see output/result and output/model).

(iii) You can change the hyperparameters in preprocess_data.sh and run_training.sh. Try to learn various models.

Result

Learning curves (x-axis is epoch and y-axis is AUC) on the test datasets of human and C. elegans are as follows:

These results can be reproduce by the above two commands (i) and (ii).

Training of our GNN-CNN using your CPI dataset

In the directory of dataset/human or celegans/original, we now have the original data "data.txt" as follows:

CC[C@@]...OC)O MSPLNQ...KAS 0
C1C...O1 MSTSSL...FLL 1
CCCC(=O)...CC=C1 MAGAGP...QET 0
...
...
...
CC...C MKGNST...FVS 0
C(C...O)N MSPSPT...LCS 1

Each line has "SMILES sequence interaction." Note that, the interaction 1 means that "the pair of SMILES and sequence has interaction" and 0 means that "the pair does not have interaction." If you prepare a dataset with the same format as "data.txt" in a new directory (e.g., dataset/yourdata/original), you can train our GNN-CNN using your dataset by the above two commands (i) and (ii).

TODO

How to cite

@article{tsubaki2018compound,
  title={Compound-protein Interaction Prediction with End-to-end Learning of Neural Networks for Graphs and Sequences},
  author={Tsubaki, Masashi and Tomii, Kentaro and Sese, Jun},
  journal={Bioinformatics},
  year={2018}
}