Motivation: Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not extremely useful for de novo structure prediction. Method: This paper presents a new deep learning method for contact prediction that predicts contacts by integrating both evolutionary coupling (EC) information and sequence conservation information through an ultra-deep neural network consisting of two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. This neural network allows us to model very complex relationship between sequence and contact map as well as long-range interdependency between contacts and thus, obtain high-quality contact prediction. Results: Our method greatly outperforms existing contact prediction methods and leads to much more accurate contact-assisted protein folding. For example, on the 105 CASP11 test proteins, the L/10 long-range accuracy obtained by our method is 83.3% while that by CCMpred and MetaPSICOV (the CASP11 winner) is 43.4% and 60.2%, respectively. On the 398 membrane proteins, the L/10 long-range accuracy obtained by our method is 79.6% while that by CCMpred and MetaPSICOV is 51.8% and 61.2%, respectively. Ab initio folding guided by our predicted contacts can yield correct folds (i.e., TMscore>0.6) for 224 of the 579 test proteins, while that by MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Further, our contact-assisted models also have much better quality (especially for membrane proteins) than template-based models.
The goal is to predict a 2D protein contact map given a protein's amino acid sequence and features computed from that sequence
Protein contact: two amino acid residues are in contact if their beta carbons are less than 8 angstrom apart, so for a given protein the ground truth is a symmetric, binary 2D residue-by-residue matrix
Related to other protein secondary structure prediction problems (#51, #92, and others we haven't added to our list yet)
It is well known that co-evolutionary information is useful for predicting residue contact; if two residues are in contact, then the bond may be perturbed if one evolves independently of the other (more or less)
Evolutionary coupling requires multiple sequence alignments for a large group of homologs to work well and performs poorly for smaller homology groups
The unnamed method here still performs better when there are more effective sequence homologs for the target protein, but its performance does not drop off as badly for proteins with less homologous information
They evaluate on several standard datasets, including CASP11; CASP12 recently closed so I would expect this method was benchmarked there as well
Computational aspects
The core ideas are using deep residual networks and combining 1D and 2D features
Input: 1D information from the protein amino acid sequence (including derived features such as predicted solvent accessibility) and 2D information such as mutual information and evolutionary coupling of residue pairs
The basic unit of their residual network does batch normalization -> ReLU activation -> convolution -> RelU -> convolution, and then the input of that sequence is added to the output
The flow of the network is 1D transformations of the 1D features, converting the output of the 1D network into new 2D features and combining with the existing 2D features, many 2D transformations of the 2D features, sigmoid outputs for every residue pair
The 1D to 2D mapping creates a feature vector for residues i and j by concatenating the 1D residual network's outputs for residues i, (i+j)/2, and j and the external 2D features (mutual information, etc.)
Neural network details: Theano-based, 1D windows of width 17, 2D windows are 3x3 or 5x5, 6 1D convolutional layers but 60 2D convolutional layers worked well, up-weight the positive instances to account for class imbalance, L2 regularization on parameters, train with stochastic gradient descent
An important difference from previous work is that the residue pair predictions are not made independently; they don't call it multitask learning, but that might be what they mean
They cite previous related work that uses neural networks; it would be worth reading those papers to contrast them with this residual network
Create their own training set of 6767 proteins from PDB that have < 25% sequence identity with all of the test proteins; split those into 7 training and validation sets
I didn't find any code available
Why include it in the review
The performance gains seem quite impressive; I wish I knew the domain better to put this in context, but some of their headlines are huge
They can correctly fold 224 of 579 proteins when previous methods, including the CASP11 winner, can do 79 or 62
When evaluating accuracy of the top L long range contacts, where L is the sequence length, they are substantially better than the competitors
They also outperform a template-based method in almost all cases, which is especially useful for proteins like membrane proteins that don't have many relevant templates available
There had been general discussion (#88) about whether deep learning has been transformative or incremental in biology, and this paper suggests the residual network provided huge boosts; if we agree with that interpretation, it would be worth studying what was special about this problem and design that worked so well (features + network architecture + training data) and if there are any broad conclusions to draw
This is the only bio paper I've read so far that uses a deep network
As in some other papers, they were limited by hardware; they were unable to assess more than 100 convolutional layers because their GPU had 12 GB RAM; they are working to distribute the algorithm on multiple GPUs
Published http://doi.org/10.1371/journal.pcbi.1005324 Preprint http://doi.org/10.1101/073239