Learning Structural Motif Representations For Efficient Protein Structure Search

https://doi.org/10.1101/137828

Understanding the relationship between protein structure and function is a fundamental problem in protein science. Given a protein of unknown function, fast identification of similar protein structures from the Protein Data Bank (PDB) is a critical step for inferring its biological function. Such structural neighbors can provide evolutionary insights into protein conformation, interfaces and binding sites that are not detectable from sequence similarity. However, the computational cost of performing pairwise structural alignment against all structures in PDB is prohibitively expensive. Alignment-free approaches have been introduced to enable fast but coarse comparisons by representing each protein as a vector of structure features or fingerprints and only computing similarity between vectors. As a notable example, FragBag represents each protein by a “bag of fragments”, which is a vector of frequencies of contiguous short backbone fragments from a predetermined library. Here we present a new approach to learning effective structural motif presentations using deep learning. We develop DeepFold, a deep convolutional neural network model to extract structural motif features of a protein structure. Similar to FragBag, DeepFold represents each protein structure or fold using a vector of learned structural motif features. We demonstrate that DeepFold substantially outperforms FragBag on protein structural search on a non-redundant protein structure database and a set of newly released structures. Remarkably, DeepFold not only extracts meaningful backbone segments but also finds important long-range interacting motifs for structural comparison. We expect that DeepFold will provide new insights into the evolution and hierarchical organization of protein structural motifs. The source code for generating DeepFold representation can be downloaded at https://github.com/largelymfs/DeepFold.

@j3xugit We don't discuss this task in the protein structure sub-section. Should we add this paper?

This paper studies protein structure homolog search, i.e., to find similar protein structures for a given query structure. Structure homolog search is almost a solved problem, but this manuscript provides a much more efficient solution. The section I have written mostly focuses on protein structure prediction, which is still an unsolved problem and also much more challenging than structure homolog search. So whether we shall discuss this paper in the protein structure subsection really depends on what we want to cover.

On Mon, May 15, 2017 at 5:30 AM, Anthony Gitter notifications@github.com wrote:

https://doi.org/10.1101/137828

Understanding the relationship between protein structure and function is a fundamental problem in protein science. Given a protein of unknown function, fast identification of similar protein structures from the Protein Data Bank (PDB) is a critical step for inferring its biological function. Such structural neighbors can provide evolutionary insights into protein conformation, interfaces and binding sites that are not detectable from sequence similarity. However, the computational cost of performing pairwise structural alignment against all structures in PDB is prohibitively expensive. Alignment-free approaches have been introduced to enable fast but coarse comparisons by representing each protein as a vector of structure features or fingerprints and only computing similarity between vectors. As a notable example, FragBag represents each protein by a “bag of fragments”, which is a vector of frequencies of contiguous short backbone fragments from a predetermined library. Here we present a new approach to learning effective structural motif presentations using deep learning. We develop DeepFold, a deep convolutional neural network model to extract structural motif features of a protein structure. Similar to FragBag, DeepFold represents each protein structure or fold using a vector of learned structural motif features. We demonstrate that DeepFold substantially outperforms FragBag on protein structural search on a non-redundant protein structure database and a set of newly released structures. Remarkably, DeepFold not only extracts meaningful backbone segments but also finds important long-range interacting motifs for structural comparison. We expect that DeepFold will provide new insights into the evolution and hierarchical organization of protein structural motifs. The source code for generating DeepFold representation can be downloaded at https://github.com/largelymfs/DeepFold.

@j3xugit https://github.com/j3xugit We don't discuss this task in the protein structure sub-section. Should we add this paper?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/445, or mute the thread https://github.com/notifications/unsubscribe-auth/AKR63vi-b8bq0U5xzVyOlys8ICwAlpYKks5r6ClOgaJpZM4Na8tM .

Professor Toyota Technological Institute at Chicago 6045 S. Kenwood Ave. Chicago, IL 60637 fax: 773 834 2557, Google Voice: 773 359 3721 http://ttic.uchicago.edu/~jinbo/

I didn't read it very closely. If you think it is a good example of deep learning making improvements on an important problem, even if those improvements are on the efficiency side instead of the performance side, we could include it. Early in the section where we state we'll focus on secondary structure and contact maps we could add a line that deep learning is improving other structure-related tasks such as homolog search (and maybe others?) and cite this. It's up to you.

If we want to talk about this paper, in the same spirit we will have to discuss a few others.

Another minor concern is that this paper has not been published yet. I am not sure if it has been accepted for publication or not.

On Mon, May 15, 2017 at 9:17 AM, Anthony Gitter notifications@github.com wrote:

I didn't read it very closely. If you think it is a good example of deep learning making improvements on an important problem, even if those improvements are on the efficiency side instead of the performance side, we could include it. Early in the section where we state we'll focus on secondary structure and contact maps we could add a line that deep learning is improving other structure-related tasks such as homolog search (and maybe others?) and cite this. It's up to you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/445#issuecomment-301488762, or mute the thread https://github.com/notifications/unsubscribe-auth/AKR63kIt0zNqcZGjG4hXmIoK7lNUaKpLks5r6F5ogaJpZM4Na8tM .

Professor Toyota Technological Institute at Chicago 6045 S. Kenwood Ave. Chicago, IL 60637 fax: 773 834 2557, Google Voice: 773 359 3721 http://ttic.uchicago.edu/~jinbo/

It it expands the scope too much, we can omit it. Because we days away from submitting, this is not a good time for major restructuring. I prefer to tell a focused story about only specific areas where deep learning really could make an impact or has already.

I'm not concerned about unpublished preprints. We need to read them carefully, but we've used them extensively. In this case, I'd argue that you are at least as qualified as whoever will ultimately review it if not more so.

greenelab / deep-review

Learning Structural Motif Representations For Efficient Protein Structure Search #445