Is Multitask Deep Learning Practical for Pharma?

alxndrkalinin commented 7 years ago

https://doi.org/10.1021/acs.jcim.7b00146

Multitask deep learning has emerged as a powerful tool for computational drug discovery. However, despite a number of preliminary studies, multitask deep networks have yet to be widely deployed in the pharmaceutical and biotech industries. This lack of acceptance stems from both software difficulties and from lack of understanding of the robustness of multitask deep networks. Our work aims to resolve both of these barriers to adoption. We introduce a high-quality open-source implementation of multitask deep networks as part of the DeepChem open-source platform. Our implementation enables simple python scripts to construct, fit, and evaluate sophisticated deep models. We use our implementation to analyze the performance of multitask deep networks and related deep models on four collections of pharmaceutical data (three of which have not previously been analyzed in the literature). We split these datasets into train/valid/test using time and neighbor-split to test multitask deep-learning performance under challenging conditions. Our results demonstrate that multitask deep networks are surprisingly robust and can offer strong improvement over random forests. Our analysis and open-source implementation in DeepChem provide an argument that multitask deep-networks are ready for widespread use in commercial drug discovery.

@rbharath's take on testing multitask deep learning performance in DeepChem.

agitter commented 7 years ago

I'm still working through the main results, but this immediately caught my eye:

To encourage adoption of multitask deep-learning methods, we open source all modeling code and datasets for the Kaggle, Factors, Kinase, and UV dataset collections as part of the DeepChem example suite. We hope that this example code and data will facilitate broader adoption of multitask deep-learning techniques for commercial drug discovery.

That's in reference to four datasets from the Merck authors. In our data sharing discussion, we wrote:

Private companies may establish a competitive advantage by releasing data sufficient for improved methods to be developed.

This would be a perfect example of that! (cc @cgreene)

cgreene commented 7 years ago

Nice! I think the term in the industry is "pre-competitive". It'll be very nice to have something to note there, as opposed to me saying something random into the breeze. :smile:

agitter commented 7 years ago

I don't think the performance here requires changing the tone of our Ligand-based prediction of bioactivity section. They introduce two multitask architectures -- progressive and bypass -- that are new to drug discovery. Those are benchmarked along with singletask fully connected networks, multitask networks, and random forest on four Merck datasets. They assess performance on the continuous labels using R^2.

In general, neural networks are better than random forest, but not for all tasks. In general, multitask networks are better than singletask, but not always. They look at task-task similarity in the relevant datasets, which influences the success of multitask models.

As I noted above, I think the collaboration with Merck is one of the more exciting components. Three of the four Merck datasets (all but the Kaggle dataset) are going to be released publicly for the first time via their DeepChem repository. That will open them to others for further benchmarking.

mrwns commented 7 years ago

@agitter I agree!

these new architectures were also briefly mentioned in the moleculeNet paper https://arxiv.org/pdf/1703.00564.pdf (which we already cite) - the moleculeNet paper refers to this paper here as "Manuscript in preparation". we should definitely cite the current paper too.

agitter commented 6 years ago

We could discuss this more extensively in the drug discovery section, but we do cite it now in the Discussion so I'm closing the issue.

greenelab / deep-review

Is Multitask Deep Learning Practical for Pharma? #576