mtzgroup / ChemPixCH

Recognising hand-drawn molecules with neural networks
Apache License 2.0
29 stars 5 forks source link
papers-with-code

Hand Drawn Hydrocarbon Recognition

This repo contains the code and data needed to generate all the data for the paper

"Recognizing Hand-drawn Hydrocarbon Structures with Neural Networks: A Practical Case Study of Deep Learning and Synthetic Data Generation in a Chemistry Context" Weir, H., Thompson, K., Woodward, A., Choi, B., Braun, A., Martinez, T.J.

The Im2Smiles neural network and code to build the datasets for training are included.

The Im2Smiles code is modified from the Im2Latex network by Guillaume Genthial.

Installation

  1. Clone ChemPixCH repo

    git clone https://github.com/mtzgroup/ChemPixCH.git
  2. Create conda environment and install packages

    cd ChemPixCh
    conda env create -f environment.yaml
  3. Activate environment

    source activate im2smiles

    or

    conda activate im2smiles
  4. Check environment is installed correctly:

    cd im2smiles
    make train-small

Building Datasets

The datasets can be generated in the data directory.

The datasets can be built by running
$ python build.py
in the following directories.

Figure 2:
Training, Validation and Test set: data/clean-RDKit/

Figure 7a:
Training, Validation and Test set: data/synthetic/pipeline_stages/

Figure 7b:
Training, Validation and Test set: data/synthetic/size_tests/

Figure 8a:
Training set: data/synthetic/size_tests/
Validation set: data/hand-drawn/hand-drawn-val/

Figure 8b:
Training set: data/hand-drawn/hand-drawn-training/training-sets/
Validation set: data/hand-drawn/hand-drawn-training/validation-sets/

Figure 8c:
Pre-Training:
Training set: data/synthetic/size_tests/500K/
Validation Sets: data/hand-drawn/hand-drawn-training/validation-sets/ and data/synthetic/size_tests/500K/

Fine-tuning:
Training set:data/hand-drawn/hand-drawn-training/training-sets/
Validation set: data/hand-drawn/hand-drawn-training/validation-sets/

Hand-drawn test set: data/hand-drawn/test-set/

Training im2smiles network

Getting started

Move to the im2smiles directory: $ cd im2smiles

If you haven't already, check the environment is installed correctly by training on the small dataset:
$ make train-small

Training on synthetic and hand-drawn data

Figure 2:
Figure2a $ make train-clean-rdkit-10K
$ make train-clean-rdkit-50K
$ make train-clean-rdkit-100K
$ make train-clean-rdkit-200K
$ make train-clean-rdkit-500K

Figure 7a:
$ make train-SD-stage-rdkitp
$ make train-rd-stage-rdkitp-aug
$ make train-rd-stage-rdkitp-aug-bkg
$ make train-rd-stage-rdkitp-aug-bkg-deg

Figure 7b:
$ make train-SD-sizes-50K
$ make train-SD-sizes-100K
$ make train-SD-sizes-200K
$ make train-SD-sizes-500K

Figure 8a:
$ make train-HDval-50K
$ make train-HDval-100K
$ make train-HDval-200K
$ make train-HDval-500K

Figure 8b:
$ make train-HDtrain-0_100
$ make train-HDtrain-10_90
$ make train-HDtrain-50_50
$ make train-HDtrain-90_10
$ make train-HDtrain-100_0

Figure 8c:
Restart weights of HDval-500K and SD-500K training, followed by
$ make train-HDtrain-90_10

Maintainers

Hayley Weir