Multi-task Neural Networks for QSAR Predictions

A variant of this network won a 2012 Kaggle competition sponsored by Merck. The task was to predict the activity of chemical compounds against 15 different targets. The interview with the winning team summarizes why this was exciting. Salakhutdinov, Hinton, and students wanted to make the point that deep learning could perform well in new domains without domain knowledge or feature engineering. (Whether or not we agree with that could be an interesting review topic.) This paper directly or indirectly inspired related work in virtual screening (e.g. #55).

QSAR = Quantitative Structure-Activity Relationship
Input and output similar to #55. Molecular descriptors as input. Activity for a particular target as output.
Can be applied to a variety of targets including cellular assays, not just individual proteins. In contrast, the structure-based method of #56 is only applicable for protein targets.
Do not use the Kaggle/Merck data here but rather a comparable formulation with 19 PubChem assays. Some are similar assays (e.g. multiple assays for Sentrin-specific protease inhibitors). The Merck data are not available for research now that the competition has closed.

Their literature review shows Bayesian neural networks had already used in the domain since the late 90s but were computationally demanding.
Introduced multi-task networks for this problem (cite some related work from 2006).
Earlier neural networks had few hidden layers and hidden units. They argue that with new (at the time) regularization techniques, it is possible and important to assess wide and deep networks.
They use about 4000 chemical descriptors from Dragon as input, which are then z-score normalized. They do not use the fingerprints used in #55 but mention that they could be helpful.
Perform multi-task binary classification but note this could be posed as a regression or ranking problem.
Rigorous hyper-parameter selection with Bayesian optimization.
For 14 of 19 assays the test AUC is significantly better than baseline methods (logistic regression, random forest, and gradient boosted decision trees, though I don't see the logistic regression results).
Compare multi-task performance versus combining data from related assays.
It is better to include all features instead of performing feature selection in advance. After about half of the features are dropped there is a large drop in AUC.
They test 1, 2, and 3 hidden layer networks but find the effect is inconsistent across assays. In contrast, on the Merck data > 1 hidden layer was important, perhaps because there were more compounds available for some assays.

greenelab / deep-review