Target Pred Py is a simple machine learning model to predict the binding behavior of small molecule drugs in Python.
Similar work has been conducted by SwissTargetPrediction, Predict NPS, and SuperPred.
Target Pred Py is a simple model that can be easily expanded and improved on. Currently it uses FP6 fingerprints, and feeds them into a random forest classifier with a configurable number of trees. The sklearn random forest classifier holds all the decision trees in memory at the same time, and with the size of this data set (~200MB for just the features to SMILES with ChEMBL 25) the memory requirements increase rapidly along with the trees. It takes an AWS r5.4xlarge (with 128GB RAM) to train the model with 150 trees in the forest, and it would require roughly double the memory to serialize the model and save it for later use. Refactoring to a different random forest library, or writing our own, would help here.
Currently the model with only FP6 fingerprints for features and only 150 trees in the random forest achieves 78% for top-1 precision, recall, and F1 score.
Although increasing the number of trees from 10 (77% accuracy) to 150 (78% accuracy) provides minimal improvement, it provides a measurable difference in top-5 accuracy.
Top-5 accuracy increases in a linear fashion from 89% at 10 trees to 96% with 150.
Adding more features from molecular descriptors or using an ensemble model would likely boost accuracy without much engineering effort.
Also, experiments with different models such as Logistic Regression (like SwissTargetPrediction), SVMs, and neural networks should be tried, experiments are ongoing in the notebooks
folder.
A simple nerual network gets 79% top-1 accuracy and 97% top-5 accuracy, and the chemprop neural network gets 99% accuracy in preliminary testing.
The primary model is in StructureToMOARFModel
, which predicts a mechanism of action from the structure of a drug-like molecule.
This model is trained by creating a data set of chemical structures (encoded as SMILES) mapped to mechanisms of action.
The SMILES data is used to generate a feature vector for each molecule with chemical fingerprinting algorithms, and this is fed into a random forest machine learning algorithm.
We use Anaconda as the base Python, install it from here: https://www.anaconda.com/products/individual.
Then to get up and running, create the environment:
conda create --name target-pred-py
conda activate target-pred-py
conda install -c pytorch -n target-pred-py pytorch cudatoolkit=10.2 # or cpuonly if you don't have CUDA
conda install -c conda-forge -n target-pred-py rdkit textdistance
conda install -n target-pred-py scikit-learn
Then make the dataset, features, and train the model:
export PYTHONPATH=$PWD
cd src/data/
python make_dataset.py
cd ../features/
python build_features.py
cd ../models/
python train_structuretomoa.py --rf
Now you can predict on new molecules:
python predict_structuretomoa.py --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" --rf # SMILES string for aspirin (https://pubchem.ncbi.nlm.nih.gov/compound/2244)
Example output:
CC(=O)OC1=CC=CC=C1C(=O)O predicted to act on:
target probability
0 Cyclooxygenase-2, Cyclooxygenase-1 0.696304
1 Carbonic anhydrase II, Carbonic anhydrase VA, ... 0.303696
2 p53-binding protein Mdm-2 0.000000
3 Focal adhesion kinase 1 0.000000
4 Dipeptidyl peptidase IV, Dipeptidyl peptidase ... 0.000000
Target Pred Py uses and includes data from ChEMBL, data is from http://www.ebi.ac.uk/chembl - the version of ChEMBL is chembl_25.
Project structure based on the cookiecutter data science project template.