Convolutional LSTM Networks for Subcellular Localization of Proteins

traversc commented 7 years ago

_Edit: DOI link https://doi.org/10.1007/978-3-319-21233-3_6_

Hi Dr. Greene et al.,

Here is another paper that I found to have a quite interesting premise: https://arxiv.org/pdf/1503.01919.pdf (apologies if this paper was already mentioned)

The idea is that LSTMs would be able to connect distant parts of a protein (or DNA) sequence because unlike a RNN, LSTMs have long term as well as short term memory "channels". Now that I learn more about the LSTM architecture, I think this makes a lot of sense, since protein folding or 3D structure may bring distant parts of a gene/protein sequence together. LSTMs may be able to connect these distant parts in a way that CNN alone would not be able to.

Although CNNs and DAs were mentioned in the "Overall manuscript structure", I think it's important to write on RNN/LSTMs since they also seem to be a good "fit" for sequencing data.

On an unrelated note, I saw this interesting blog/website that succinctly summarizes many forms of architecture in the "Neural Network Zoo" (it also summarizes some more classical ML algorithms). http://www.asimovinstitute.org/neural-network-zoo/

It reminds me of the phrase "endless forms most beautiful" from the Sean B. Carroll's book on evo/devo.

agitter commented 7 years ago

Thanks, I edited the first post to add the DOI

agitter commented 7 years ago

@traversc Does this paper compare to any other baseline approaches? How do they assess the performance of the LSTM model?

traversc commented 7 years ago

The paper compared 5 models:

R-LSTM - a "regular" LSTM described in their paper
A-LSTM - the LSTM model with an "attention mechanism"; they use this approach to determine which parts of the sequence are important, not necessarily to obtain the best performance
R-LSTM ensemble - an ensemble model of 10 LSTMs. The paper didn't specify how the individual models differed
MultiLoc and Sherloc - Ensemble SVM based methods from another group (however, this incorporates additional information such as Gene Ontology, rather than just the sequence)

To evaluate their model performance, they use a 20% hold out test set (on a dataset with 6000 protein sequences and 11 possible protein compartments (E.R., golgi, etc.) and calculated accuracy. Their results are summarized in table 1:

Method	Accuracy
`Sequence only`
R-LSTM	0.879
A-LSTM	0.854
R-LSTM ensemble	0.902
MultiLoc	0.767
`Sequence + metadata`
MultiLoc + PhyloLoc	0.842
MultiLoc + PhyloLoc + GOLoc	0.871
MultiLoc2	0.887
SherLoc2	0.930

greenelab / deep-review

Convolutional LSTM Networks for Subcellular Localization of Proteins #103