greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.24k stars 272 forks source link

Convolutional LSTM Networks for Subcellular Localization of Proteins #103

Open traversc opened 7 years ago

traversc commented 7 years ago

_Edit: DOI link https://doi.org/10.1007/978-3-319-21233-3_6_

Hi Dr. Greene et al.,

Here is another paper that I found to have a quite interesting premise: https://arxiv.org/pdf/1503.01919.pdf (apologies if this paper was already mentioned)

The idea is that LSTMs would be able to connect distant parts of a protein (or DNA) sequence because unlike a RNN, LSTMs have long term as well as short term memory "channels". Now that I learn more about the LSTM architecture, I think this makes a lot of sense, since protein folding or 3D structure may bring distant parts of a gene/protein sequence together. LSTMs may be able to connect these distant parts in a way that CNN alone would not be able to.

Although CNNs and DAs were mentioned in the "Overall manuscript structure", I think it's important to write on RNN/LSTMs since they also seem to be a good "fit" for sequencing data.

On an unrelated note, I saw this interesting blog/website that succinctly summarizes many forms of architecture in the "Neural Network Zoo" (it also summarizes some more classical ML algorithms). http://www.asimovinstitute.org/neural-network-zoo/

It reminds me of the phrase "endless forms most beautiful" from the Sean B. Carroll's book on evo/devo.

agitter commented 7 years ago

Thanks, I edited the first post to add the DOI

agitter commented 7 years ago

@traversc Does this paper compare to any other baseline approaches? How do they assess the performance of the LSTM model?

traversc commented 7 years ago

The paper compared 5 models:

To evaluate their model performance, they use a 20% hold out test set (on a dataset with 6000 protein sequences and 11 possible protein compartments (E.R., golgi, etc.) and calculated accuracy. Their results are summarized in table 1:

Method Accuracy
Sequence only
R-LSTM 0.879
A-LSTM 0.854
R-LSTM ensemble 0.902
MultiLoc 0.767
Sequence + metadata
MultiLoc + PhyloLoc 0.842
MultiLoc + PhyloLoc + GOLoc 0.871
MultiLoc2 0.887
SherLoc2 0.930