greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.24k stars 272 forks source link

Low Data Drug Discovery with One-shot Learning #141

Open agitter opened 7 years ago

agitter commented 7 years ago

https://doi.org/10.1021/acscentsci.6b00367 (preprint https://arxiv.org/abs/1611.03199)

Recent advances in machine learning have made significant contributions to drug discovery. Deep neural networks in particular have been demonstrated to provide significant boosts in predictive power when inferring the properties and activities of small-molecule compounds. However, the applicability of these techniques has been limited by the requirement for large amounts of training data. In this work, we demonstrate how one-shot learning can be used to significantly lower the amounts of data required to make meaningful predictions in drug discovery applications. We introduce a new architecture, the residual LSTM embedding, that, when combined with graph convolutional neural networks, significantly improves the ability to learn meaningful distance metrics over small-molecules. We open source all models introduced in this work as part of DeepChem, an open-source framework for deep-learning in drug discovery.

This looks very exciting, in part because of their open source software https://github.com/deepchem/deepchem

XieConnect commented 7 years ago

Right. I was just about to post this one earlier. It seems also relevant to the "wide data, few samples" issue, as it reduces sample size requirement.

agitter commented 7 years ago

This paper can be featured in the Treat section as well as the data limitations, code sharing, and transfer learning sub-sections of the Discussion.

Many of the virtual screening methods (see #45) require large training datasets with thousands or millions of instances, where an instance is a chemical and its activity in an assay of interest. In practice, a typical chemical screen may have substantially less training data to work with. The authors propose one shot learning to overcome the sparse training data, taking advantage of side information in the form of prior screening data for other assays.

The main idea is that the network will use the related screens to learn a mapping from chemical compounds (often featurized with discrete features or a molecular graph) into a continuous space and a similarity measure between chemicals in the continuous space. Nearest neighbor-like approaches in that continuous space can then be applied to make predictions for a new assay (aka task) with very limited task-specific training examples. Though the methods differ substantially, at a high level the mapping to a continuous space reminds me of the unsupervised #104.

Getting more technical, they compare a Siamese network and two LSTM-based approaches for executing this strategy. All of them build upon graph convolutions related to their earlier #53. The Dual Residual LSTM has the best performance of the three.

They evaluate the model in a very challenging setting where at most 10 positive and 10 negative instances are provided for the assay of interest, along with the side information for related assays. The dataset include Tox21 and MUV that have been used previously along with the SIDER dataset on drug side effects, which is especially relevant for the treat discussion. A random forest baseline has very little predict power except on MUV (which has special structure) with so few training samples. The residual LSTM is able to train reasonably well on limited data.

One shot learning will be a great contrast to multitask methods in this domain. They show that training on Tox21 tasks and evaluating on SIDER does not work well, so there is a limit to the transferability.

The code adds a lot of value https://github.com/deepchem/deepchem and provides public data with which to test their models and reproduce their results.

It will be very interesting to see what happens in the "intermediate" data range where one has more than 20 assay-specific examples but not tens of thousands. They are likely working on this (#148) but it is not featured here.

agitter commented 7 years ago

Updated with the DOI of the published version. Also adding this link to the accompanying press release http://news.stanford.edu/press-releases/2017/04/03/deep-learning-aldrug-development/

agitter commented 7 years ago

A nice perspective article on this work http://doi.org/10.1021/acscentsci.7b00153