robinlingwood / BIMODAL

5 stars 30 forks source link

Bidirectional Molecule Generation with Recurrent Neural Networks

Please note that up-to-date versions of this code are found here.

This is the supporting code for: Grisoni F., Moret M., Lingwood R., Schneider G., "Bidirectional Molecule Generation with Recurrent Neural Networks". Journal of Chemical Information and Modeling (2020). Available here.

You can use this repository for the generation of SMILES with bidirectional recurrent neural networks (RNNs). In addition to the methods' code, several pre-trained models for each approach are included.

The following methods are implemented:

Table of Contents

  1. Prerequisites
  2. Using the Code
    1. Sampling from a pre-trained model
    2. Training a model on your data
    3. Fine-tuning a model on your data
  3. Authors
  4. License
  5. How to cite

Prerequisites

This repository can be cloned with the following command:

git clone https://github.com/ETHmodlab/BIMODAL

To install the necessary packages to run the code, we recommend using conda. Once conda is installed, you can install the virtual environment:

cd path/to/repository/
conda env create -f brnn.yml

To activate the dedicated environment:

conda activate brnn

Your code should now be ready to use!

Using the code

Sampling from a pre-trained model

In this repository, we provide you with 22 pre-trained models you can use for sampling (stored in evaluation/). These models were trained on a set of 271,914 bioactive molecules from ChEMBL22 (Kd/I/IC50/EC50 <1μM), for 10 epochs.

To sample SMILES, you can create a new file in model/ and use the Sampler class. For example, to sample from the pre-trained BIMODAL model with 512 units:

from sample import Sampler
experiment_name = 'BIMODAL_fixed_512'
s = Sampler(experiment_name)
s.sample(N=100, stor_dir='../evaluation', T=0.7, fold=[1], epoch=[9], valid=True, novel=True, unique=True, write_csv=True)

Parameters:

Notes:

Training a New Model

Alternatively, if you want to pre-train a model on your own data, you will need to execute three steps: (i) data processing (ii) training and (iii) evaluation. Please be aware that you will need the access to a GPU to pre-train your own model as this is a computationally intensive step.

Preprocessing

Data can be processed by using preprocessing/main_preprocessor.py:

from main_preprocessor import preprocess_data
preprocess_data(filename_in='../data/chembl_smiles', model_type='BIMODAL', starting_point='fixed', augmentation=1)

Parameters:

Notes:

Training

Training requires a parameter file (.ini) with a given set of parameters. You can find examples for all models in experiments/, and further details about the parameters below:

Section Parameter Description Comments
Model model Type ForwardRNN, FBRNN, BIMODAL, NADE
hidden_units Number of hidden units Suggested value: 256 for ForwardRNN, FBRNN and NADE; 128 for BIMODAL
generation To be defined only for NADE (other models defined through preprocessing) fixed, random
Data data Name of data file Has to be located in data/
encoding_size Number of different SMILES tokens 55
molecular_size Length of string with padding See preprocessing
missing_token To add in the parameter file only for NADE M
Training epochs Number of epochs Suggested value: 10
learning_rate Learning rate Suggested value: 0.001
n_folds Folds in cross-validation See below: More than 1 for cross_validation, 1 to use only one fold of the data for validation
batch_size Batch size Suggested value: 128
Evaluation samples Number of generated SMILES after each epoch
temp Sampling temperature Suggested value: 0.7
starting_token Starting token for sampling G for all models except NADE, which requires a sequence consisting of missing values (see publication)

Note:

Options for training:

t = Trainer(experiment_name = 'BIMODAL_fixed_512') t.cross_validation(stor_dir = '../evaluation/', restart = False)


- Single run: 1/*n_folds* of data used for validation

from trainer import Trainer

t = Trainer(experiment_name = 'BIMODAL_fixed_512') t.single_run(stor_dir = '../evaluation/', restart = False)


Parameters:   
* *experiment_name* :  Name of parameter file (.ini)
* *stor_dir*: Directory where outputs can be found
* *restart*: If true, automatic restart from saved models (e.g. to be used if your training was interrupted before completion)

### Evaluation

You can do the evaluation of the outputs of your experiment with the [evaluation/main_evaluator.py](evaluation/main_evaluator.py) with the following possibilities:   

from evaluation import Evaluator

stor_dir = '../evaluation/' e = Evaluator(experiment_name = 'BIMODAL_fixed_512')

Plot training and validation loss within one figure

e.eval_training_validation(stor_dir=stor_dir)

Plot percentage of novel, valid and unique SMILES

e.eval_molecule(stor_dir=stor_dir)


Parameters:
* *experiment_name*:  Name parameter file (.ini)
* *stor_dir*: Directory where outputs can be found

Note:
- the losses plot can be found, in that case, in '{experiment_name}/statistic/all_statistic.png'
- the novel, valid and unique SMILES plot can be found, in that case, in '../evaluation/{experiment_name}/molecules/novel_valid_unique_molecules.png'    

## Fine-tuning a model<a name="Finetuning"></a>

Fine-tuning requires a pre-trained model and a parameter file (.ini).
Examples of the parameter files (BIMODAL and ForwardRNN) are provided in [experiments/](experiments/).

You can start the sampling procedure with [model/main_fine_tuner.py](model/main_fine_tuner.py)

|Section        |Parameter      | Description           |Comments |
| --- | --- | --- | --- |   
|Model      |model          | Type              | ForwardRNN, FBRNN, BIMODAL, NADE  |
|       |hidden_units   | Number of hidden units    |   Suggested value: 256 for ForwardRNN, FBRNN and NADE;  128 for BIMODAL|
|       |generation | Only NADE (other models defined through preprocessing)            | fixed, random |
|Data       |data       | Name of data file     | Has to be located in data/ |
|       |encoding_size  | Number of different SMILES tokens     | 55 |
|       |molecular_size | Length of string with padding | See preprocessing |
|       |missing_token  | To add in the parameter file only for NADE            | M |
|Training   |epochs     | Number of epochs      |  Suggested value: 10 |
|       |learning_rate  | Learning rate         |  Suggested value: 0.001|
|       |batch_size | Batch size            |  Suggested value: 128  |
|Evaluation | samples   | Number of generated SMILES after each epoch |  |
|       | temp      | Sampling temperature      | Suggested value: 0.7 |
|       | starting_token    | Starting token for sampling   | G for all models except NADE, which requires a sequence consisting of missing values (see publication)    |
|Fine-Tuning        |start_model            | Name of pre-trained model to be used for fine-tuning              |   |

To fine-tune a model, you can run:

t = FineTuner(experiment_name = 'BIMODAL_random_512_FineTuning_template') t.fine_tuning(stor_dir='../evaluation/', restart=False)


Parameters:
* *experiment_name*:  Name parameter file (.ini)
* *stor_dir*: Directory where outputs can be found
* *restart*: If True, automatic restart from saved models (e.g. to be used if your training was interrupted before completion)

Note:
-  The batch size should not exceed the number of SMILES that you have in your fine-tuning file (taking into account the data augmentation).

## Authors<a name="Authors"></a>

* Robin Lingwood (https://github.com/robinlingwood)
* Francesca Grisoni (https://github.com/grisoniFr)
* Michael Moret (https://github.com/michael1788)

See also the list of [contributors](https://github.com/ETHmodlab/Bidirectional_RNNs/contributors) who participated in this project.

## License<a name="License"></a>

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

## How to Cite <a name="cite"></a>

If you use this code (or parts thereof), please cite it as:

@article{grisoni2020, title={Bidirectional Molecule Generation with Recurrent Neural Networks}, author={Grisoni, Francesca and Moret, Michael and Lingwood, Robin and Schneider, Gisbert}, journal={Journal of Chemical Information and Modeling}, volume={Article ASAP}, number={}, pages={}, year={2020}, doi = {10.1021/acs.jcim.9b00943}, url = {https://pubs.acs.org/doi/10.1021/acs.jcim.9b00943}, publisher={ACS Publications} }