jinhojsk515 / spmm

Multimodal learning for chemical domain, with SMILES and properties.
Apache License 2.0
26 stars 4 forks source link

SPMM: Structure-Property Multi-Modal learning for molecules

The official GitHub for SPMM, a multi-modal molecular pre-trained model for a synergistic comprehension of molecular structure and properties. The details can be found in the following paper: Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model. (Nature Communications 2024)

DOI


method1

Molecule structure will be given in SMILES, and we used 53 simple chemical properties to build a property vector(PV) of a molecule.

The model checkpoint and data are too heavy to be included in this repo, and they can be found here.

Files

Requirements

Run pip install -r requirements.txt to install the required packages.

Code running

Arguments can be passed with commands, or be edited manually in the running code.

  1. Pre-training

    python SPMM_pretrain.py --data_path './data/pretrain.txt'
  2. PV-to-SMILES generation

    • batched: The model takes PVs from the molecules in input_file, and generates molecules with those PVs using k-beam search. The generated molecules will be written in generated_molecules.txt.
      python d_pv2smiles_batched.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --input_file './data/pubchem_1k_unseen.txt' --k 2
    • single: The model takes one query PV and generates n_generate molecules with that PV using k-beam search. The generated molecules will be written in generated_molecules.txt. Here, you need to build your input PV in the code. Check the four examples that we included.
      python d_pv2smiles_single.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --n_generate 1000 --stochastic True --k 2
  3. SMILES-to-PV generation

    The model takes the query molecules in input_file, and generates their PV.

    python d_smiles2pv.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --input_file './data/pubchem_1k_unseen.txt'
  4. MoleculeNet + DILI prediction task

    d_regression.py, d_classification.py, and d_classification_multilabel.py, perform regression, binary classification, and multi-label classification tasks, respectively.

    python d_regression.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'bace'
    python d_classification.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'bbbp'
    python d_classification_multilabel.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --name 'clintox'
  5. Forward/retro-reaction prediction tasks

    d_rxn_prediction.py performs both forward/reverse reaction prediction tasks on USPTO-480k and USPTO-50k datasets.

    e.g. forward reaction prediction, no beam search

    python d_rxn_prediction.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --mode 'forward' --n_beam 1 

    e.g. retro reaction prediction, beam search with k=3

    python d_rxn_prediction.py --checkpoint './Pretrain/checkpoint_SPMM.ckpt' --mode 'retro' --n_beam 3 

Acknowledgement