Open agitter opened 7 years ago
There are a couple of references and a bit more context I would add to the the De novo design section. How is the protocol for participation? Should I do this via a Pull Request?
Normally I would say yes, go ahead with a pull request, but we plan to submit in just a couple days and many authors have already approved the manuscript. Can you please share more about your plans before making the edits? We should keep the changes fairly minor at this late stage.
It's very minor edits, mainly typos, clarifying the scope of de novo design, and additions of a few references (Disclaimer: I am one of the authors of the Segler2017 paper that you have already added)
Here is what would propose, I marked my changes as code
Two emerging areas that we anticipate will be increasingly important in deep learning for drug discovery are de novo drug design and protein structure-based models.
Whereas the goal of virtual screening is to find active molecules by predicting the biochemical activity of hundreds of thousands to millions of chemicals using given collections of molecules, _de novo_ drug design aims to directly _generate_ active compounds. DOI: 10.1021/acs.jmedchem.5b01849
Thus de novo design explores in principle without explicit enumeration the much larger space of at least 10^60 organic molecules with drug-like properties that could be chemically synthesized.[@doi:10.1002/wcms.1104]
Neural network models that learn to generate realistic, synthesizable molecules could provide large molecule sets for virtual screening or even create and refine focussed molecules for de novo design. This problem is related to the generation of syntactically and semantically correct text
.
As neural models that directly output (molecular) graphs remain under-explored
, generative neural networks for drug design typically represent chemicals with the simplified molecular-input line-entry system (SMILES), a standard, string-based representation with characters that represent atoms, bonds, and rings [@tag:Segler2017_drug_design]. Gómez-Bombarelli et al. designed a SMILES-to-SMILES autoencoder to learn a continuous latent feature space for chemicals [@tag:Gomezb2016_automatic]. TODO: connect to related EHR paper In this learned continuous space it was possible to train some types of supervised learning algorithms, and to interpolate between continuous representations of chemicals in a manner that is not possible with discrete (e.g. bit vector or string) features. Even more interesting is the prospect of performing gradient based or Bayesian optimization of molecular properties within this latent space
. A drawback is that not all SMILES strings produced by the autoencoder's decoder correspond to valid chemical structures. Recently, the Grammar Variational Autoencoder, which takes the SMILES grammar into account, has been proposed to address this issue [https://arxiv.org/abs/1703.01925].
Another approach is to train character-based RNNs on large collections of molecules, for example
ChEMBL [@doi:10.1093/nar/gkr777], to first obtain a generic generative model for drug-like compounds [@tag:Segler2017_drug_design]. These generative models successfully learn the grammar of compound representations, with 94% [@tag:Olivecrona2017_drug_design] or nearly 98% [@tag:Segler2017_drug_design] of generated SMILES corresponding to valid and predominantly reasonable
molecular structures. The initial RNN is then fine-tuned to generate molecules that are likely to be active against a specific target by either continuing training on a small set of positive examples [@tag:Segler2017_drug_design] or adopting reinforcement learning strategies [@tag:Olivecrona2017_drug_design] [ cite also https://arxiv.org/abs/1611.02796 ]
. Both fine-tuning strategies could rediscover known, hold-out active molecules.
Thanks, I like those changes. To simplify the merge, can you please wait until my next pull request for this section? Then we can stack your changes on top of mine. I'm hoping to push it later today.
Sure!
Thanks for waiting @mrwns. I merged #438 so now you can edit the text as you suggested above and file a new pull request. For arXiv papers, please format the references as [@arxiv:1611.02796]
(arxiv is case-sensitive and all lower case).
alright, i'll get going now :)
https://arxiv.org/abs/1704.07555
I'm hoping to have time this write to add a new paragraph to Treat about drug design.