Citing the paper:
@article{khurana2018deepsol,
title={DeepSol: a deep learning framework for sequence-based protein solubility prediction},
author={Khurana, Sameer and Rawi, Reda and Kunji, Khalid and Chuang, Gwo-Yu and Bensmail, Halima and Mall, Raghvendra},
journal={Bioinformatics}
}
Protein solubility can be a decisive factor in both research and production efficiency. Novel in silico, accurate, sequence-based protein solubility predictors are highly sought.
This step will install all the dependencies required for running DeepSol in an Anaconda virtual environment locally. You do not need sudo permissions for this step.
Install Anaconda
bash Anaconda3-2019.03-Linux-x86_64.sh
and follow the instructions to install.Creating the environment
git clone https://github.com/sameerkhurana10/DSOL_rv0.2.git
cd DSOL_rv0.2
export PATH=<your_anaconda_folder>/bin:$PATH
conda env create -f environment.yml
source activate dsol
(Running on machine with gpu, additionally do conda install cudnn pygpu libgpuarray
)R requirements
R
install.packages('Interpol')
)install.packages('bio3d')
)install.packages('doMC')
)Quit R REPL: quit()
SCRATCH (version SCRATCH-1D release 1.1) (http://scratch.proteomics.ics.uci.edu, Downloads: http://download.igb.uci.edu/#sspro)
wget http://download.igb.uci.edu/SCRATCH-1D_1.1.tar.gz
tar -xvzf SCRATCH-1D_1.1.tar.gz
cd SCRATCH-1D_1.1
perl install.pl
cd ..
All operations related to DeepSol models are to be performed from the folder DSOL_rv0.2
To run DeepSol on your own protein sequences you need the following two things:
data/Seq_solo.fasta
as an example R --vanilla < scripts/PaRSnIP.R data/Seq_solo.fasta <path-to-your-scratch-installation>/bin/run_SCRATCH-1D_predictors.sh new_test 32
32
is the number of processors, new_test
is the output files' prefix
Following this step, two files are created in the data
folder:
new_test_src
: contains raw protein sequencesnew_test_src_bio
: contains biological features corresponding to the raw protein sequencesNote: data/Seq_multi.fasta
can be used instead of data/Seq_solo.fasta
. Seq_multi.fasta
has multiple protein sequences
./run.sh --model deepsol1 --stage 1 --mode preprocess --device cpu --test_file new_test data/newtest.data
Note: If you get an MKL error, do export MKL_THREADING_LAYER=GNU
in run.sh
This step Preprocesses data files from step 1, and stores at data/newtest.data
in a format acceptable to Deepsol models.
Note: You can also use deepsol2
or deepsol3
in place of deepsol1
. See Paper for more details
./run.sh --model deepsol1 --stage 2 --mode decode --device cpu data/newtest.data
Result will be saved in results/reports/
. Note: you can also use deepsol2
or deepsol3
in place of deepsol1
.
Recipe is contained in the script run.sh
. To see the options run ./run.sh
and you shall see the following:
main options (for others, see top of script file)
--model (deepsol1/deepsol2/deepsol3) # model architecture to use
--mode (preprocess/train/decode/cv) # data preparation or decode or cross-validate using an existing model
--stage (1/2) # point to run the script from
--conf_file # model parameter file
--keras_backend # backend for keras
--cuda_root # the path cuda installation
--device (cuda/cpu) # device to use for running the recipe
--test_file # name of the new test file
There are two stages in the script.
protein.data
and protein_with_bio.data
.--mode train
and decoding with best DeepSol models using --mode decode
. Information about --mode cv
is given in "parameter variance check" section.We provide support for gpu usage using the option --device cuda
. More details in the GPU section.
Train DeepSol models using pre-compiled training, validation data and optimal hyper-parameter setting as in parameters.json
file:
./run.sh --model deepsol1 --stage 2 --mode train --device cpu data/protein.data
./run.sh --model deepsol2 --stage 2 --mode train --device cpu data/protein_with_bio.data
Result will be a model named deepsol1
or deepsol2
stored in results/models/
.
Note that we used --model deepsol2
, you can use deepsol3
for step 2. Ignore UserWarning
at the output.
Test existing DeepSol models with pre-compiled test data:
./run.sh --model deepsol1 --stage 2 --mode decode --device cpu data/protein.data
./run.sh --model deepsol2 --stage 2 --mode decode --device cpu data/protein_with_bio.data
Result will be saved in results/reports/
.
Note that we used --model deepsol2
, you can use deepsol3
for step 2.
Ensure that cuda is installed. We support Cuda 8.0 and Cudnn 5.1 . For any other version of Cudnn, you might run into some issues.
Install Cuda 8.0 and Cudnn 5.1 from https://developer.nvidia.com/
Code was tested against GeForce GTX 1080 Nvidia GPUs https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080/ . The GPU driver version is 384.59.
Code was also tested on Nvidia Tesla K20Xm : https://www.techpowerup.com/gpudb/1884/tesla-k20xm with driver version 375.66.
Train DeepSol models using pre-compiled training, validation data and optimal hyper-parameter setting as in parameters.json
file:
./run.sh --model deepsol1 --stage 2 --mode train --cuda_root <path-to-your-cuda-installation> --device cuda data/protein.data
./run.sh --model deepsol2 --stage 2 --mode train --cuda_root <path-to-your-cuda-installation> --device cuda data/protein_with_bio.data
.Result will be a model named deepsol1
or deepsol2
stored in results/models
.
Note that we used --model deepsol2
, you can use deepsol3
for step 2. Ignore UserWarning
at the output.
Also, --cuda_root
should be the path to your cuda installation. By default it is /usr/local/cuda
.
Test existing DeepSol models with pre-compiled test data:
./run.sh --model deepsol1 --stage 2 --mode decode --cuda_root <path-to-your-cuda-installation> --device cuda data/protein.data
./run.sh --model deepsol2 --stage 2 --mode decode --cuda_root <path-to-your-cuda-installation> --device cuda data/protein_with_bio.data
.Result will be saved in results/reports/
.
Note that we used --model deepsol2
, you can use deepsol3
for step 2.
In this section we calculate the variance in performance of the DeepSol models on 10 cross-validation folds for dataset used in our paper.
For CPU:
./run.sh --model deepsol1 --stage 2 --mode cv --device cpu data/protein.data
./run.sh --model deepsol2 --stage 2 --mode cv --device cpu data/protein_with_bio.data
For GPU:
./run.sh --model deepsol1 --stage 2 --mode cv --cuda_root <path-to-your-cuda-installation> --device cuda data/protein.data
./run.sh --model deepsol2 --stage 2 --mode cv --cuda_root <path-to-your-cuda-installation> --device cuda data/protein_with_bio.data
Result will be saved in results/reports/
.
Note that we used --model deepsol2
, you can use deepsol3
for 2.
A) On most Linux based systems, we tested on Ubuntu 14.04 and 14.10, RedHat 7.4 Maipo and Arch (both cpu and gpu).
A) conda remove --name dsol --all
error while loading shared libraries: libmpfr.so.4:
while installing SCRATCH on Arch ?A) Do ln -s /usr/lib/libmpfr.so.6.0.0 /usr/lib/libmpfr.so.4
. SCRATCH looks for mpfr.so.4
but Arch has a newer version, so we symlink the old location to the new library.