Phylogenetic reconstruction using deep learning and other ideas

traversc commented 8 years ago

Hi all, my name is Travers Ching from Dr. Lana Garmire's lab. I have been following your discussions but have been a bit shy to contribute. I found many of the discussions interesting, and wanted to share some additional ideas and discussion.

A paper titled "A new approach to the automatic identification of organism evolution using neural networks" (http://www.sciencedirect.com/science/article/pii/S0303264716300223) discusses automated phylogenetic reconstruction using neural networks. In the paper, the authors use a three layer MLP. Although I think this paper doesn't quite hit the mark, this is a really interesting idea I believe fits the "deep learning" framework very well.

An obvious question from this paper is, compared to standard phylogenetic methods, can deep learning approaches more efficiently and more accurately reconstruct phylogenies (such as on simulated datasets)? I think the answer could be yes.

Secondly, I'm wondering if there is space in the review to discuss the various methods and techniques themselves?

For example, in my experience, the type of optimizer can lead to significantly different results, even given many iterations of optimization (e.g.: https://s14.postimg.org/nqb7flu9d/optimizer.png). Clearly, standard stochastic gradient descent is not optimal, but momentum/nesterov methods seems to have different philosophies compared to e.g., RMSprop/adadelta. Who's to say which optimizer is best?

The type of regularization used (dropout, early stopping, weight decay) is also an issue that can clearly affect performance (e.g.: http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_wan13.pdf). But I don't think there is any consensus on when to use each type of regularization. Also, because these are all free optimization parameters, including the architecture of the neural network itself as a free parameter, this can lead to dangerous p-hacking; something to look out for. There are so many things that can be tuned using neural networks.

Lastly, I feel that it is important to motivate an intuition behind what techniques are likely to work on which types of biological data? For example, my impression is that convolutional neural networks will work very well when the input is nucleotide sequence data, since the data is positional and strongly locally connected. An example being the DANN paper that was discussed earlier (http://www.ncbi.nlm.nih.gov/pubmed/25338716). However, is this type of match between CNN and nucleotide data something we can prove, or at least have a reference to cite? My experience with neural networks is that, simply adding additional layers in a MLP does not improve generalized performance. And actually, it's been proven that adding too many neural network layers will decrease performance, even given unlimited computational power (http://arxiv.org/pdf/1512.03385.pdf). I believe the promise of deep learning lies within specialized architectures such as convolutional nets or recurrent nets.

cgreene commented 8 years ago

Hi @traversc and thanks for jumping in! I have a couple feature requests for the issue that you filed. When you're referring to a paper that has an issue like DANN, can you use its issue number #5 so that anyone looking at that issue will see the conversation here. When referring to a paper that doesn't yet have an issue number, can you quickly create an issue (just pasting the title into the issue name and the DOI into the field is acceptable, though if you can throw an abstract in that's even better)? This will help us track which papers have been discussed when we get around to writing the paper.

On the topic of phylogenetic reconstruction, I'm happy to include it - particularly where you see very exciting opportunities there - if it fits into the guiding question. It seems most likely to fit into the area of "study" disease to me. There is not yet an issue for that section. I think @agitter may create one soon - or you could if you want to help to organize and summarize the discussions on that topic.

agitter commented 8 years ago

My preference would be to briefly bring up the important issues about optimizers, regularization, and using neural networks in practice but then refer to other strong guides or reviews to address these. For example, the Deep Learning textbook has chapters on regularization and optimization.

I'm not sure what you mean by proving the match between CNN and sequence data. The review #47 makes this point though. In general, I think that drawing connections between certain types of biological data and neural network architectures that are particularly well-suited for them is a good topic.

We can use #97 to discuss the study of disease and more basic biological applications.

cgreene commented 8 years ago

I agree with @agitter : diving into issues that are tightly coupled to the practice are probably best left to other strong guides or reviews.

greenelab / deep-review

Phylogenetic reconstruction using deep learning and other ideas #94