greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.25k stars 271 forks source link

MUST-CNN: A Multilayer Shift-and-Stitch Deep Convolutional Architecture for Sequence-based Protein Structure Prediction #58

Open agitter opened 8 years ago

agitter commented 8 years ago

https://arxiv.org/abs/1605.03004

kumardeep27 commented 8 years ago

# MUST-CNN is a 1D classification algorithm which takes protein sequence as input and predicts at amino acid level. Authors have used Torch7 framework for the MUST-CNN protocol. Model uses multiple convolutional layers and multitasking approach to label amino acids in one of the many tasks/classes in one go. MUST-CNN describes a multilayer shit-and-stich convolutional neural network architecture for predicting protein properties at amino acid level from the protein sequences. Idea/novelty was borrowed form image classification through deep CNN on per position sequence labelling and implemented first time on protein sequence to predict amino acid level properties. Briefly, input is amino acid base pair sequence and PSI-BLAST PSSM features from amino acid embedding which are combined with PSSM matrices given to deep CNN. Algorithm shifts the amino acid sequence according to pooling in each layer and stich after convolution layers to get the deep embedding for each amino acid. Authors used whole sequence at once, while previous use windowing methods. Deep embedding are fed to multitask linear layers which were separate for individual task. CNN: They transformed the input by shifting and padding followed by convolution nonlinearity (ReLU function), maxpooloing as pooling technique. Output array is passed through a dropout layer (acting as regularization) which is a randomized mask of outputs. Dropout is removed in testing and all eights are used. MUST: maxpooling is a dimension reduction method using the max function. To overcome the preserve the identity of original input during dimension reduction, multilayer shift-and-stitch method is used. Latter is used in every layer (3 layers) producing a dense per-position predictions for a give sequence.

Data 2 main datasets of protein properties were used. 4prot: from previous study (Qi et al., 2012) where train (80%), validation(20%) and test(20%). CullPDB: from (Zhou and Troyanskaya 2014), where train(80%), validation (20%) and test- CB513 dataset. Features They used 2 types of features, a) individual amino acid features and b) PSI-BLAST generated PSSM matrices where higher PSSM score represents higher chances of amino acid replacing the current amino acid in other species. Tasks/outputs 4 types of classification tasks were reported: dssp: (8 class secondary structure prediction ) ssp: 3 class (concise 8 classed dssp) sar: relative solvent accessibility samino acid: absolute solvent accessibility

Training Small model selection: Bayesian optimization using spearmint package to get optimal parameters for small model. Large model: concerted efforts of grid search n manual tuning. Overall architecture of the MUST-CNN is given in Table 2.

Multitask model took lesser time than the fine-tuned models (4 in number for 4 tasks) # Employing maxpooling with shift-and-stich approach average accuracy improved by 0.5%. On 4prot dataset, their small model outperformed previous state of art method by Qi et al on all task. Also, the fine tuning renders increase in performance independently on 4 tasks leaving the opportunity for MLP sub classifiers for respective task. Similarly, the large models beats the Qi et al performance. Small models perform 2.5 times faster than large models for the prediction time. Author claim to report precise training and testing times for a model prediction for the first time in protein prediction domain. Table 4 contains the details of results (4prot data) in terms of F1 score, recall and precision. Table 5 shows cullPDB performance which is 1% better than pervious methods. On CB513 they have comparable performance. Which they owe to the removal of non-homologous protein sequence in training dataset.

Shift-and-stich is faster to compute CNN score on windows of a sequence in one go. Method is better than other sequence-based approaches (GSN, LSTM and CNF). MUST-CNN is faster in execution and comparable in performance to previous methods. For 2 main datasets, their main network frame remains the same except for tweaking the amount of dropout regularization emphasizing that the model is robust if one has good starting set of hyper parameters. Non-window based simpler, faster Deep CNN using multilayer shift-and-stich approach to predict per position protein structure prediction from sequence input. They used 1D CNNs with options of parameter sharing, pooling, reduced computation, highly parallelizable and faster. Only limitation is low resolution output, which is nicely tackled by using multilayer shift-and-stich approach. CNN is implemented end-to-end means in all layers and not just last layer. Shift-and-stich was initially used in image classification to improve output.