smiles724 / Molformer

88 stars 12 forks source link

Molformer

Introduction

This is the repository for our Molformer.

model

Intsallation

# Install packages
pip install pytorch scikit-learn mendeleev
pip install rdkit-pypi

Dataset

We test our model in three different domains: quantum chemistry, physiology and biophysics. We also provide information of datasets regarding the material science used in the preceding 3D-Transformer. You can download the raw datasets in the following links.

Quantum Chemistry

Physiology

Biophysics

Material Science

Models

models/tr_spe: 3D-Transformer with Sinusoidal Position Encoding (SPE)
models/tr_cpe: 3D-Transformer with Convolutional Position Encoding (CPE)
models/tr_msa: 3D-Transformer with Multi-scale Self-attention (MSA)
models/tr_afps: 3D-Transformer with Attentive Farthest Point Sampling (AFPS)
models/tr_full: 3D-Transformer with CPE + MSA + AFPS

Quick Tour

Model Usage

After processing the dataset, it is time to establish the model. Suppose there are N types of atoms, and n downstream multi-tasks. If you only need to predict a single property, set n = 1. For multi-scale self-attenion, a dist_bar is needed to define the different scales of local regions, such as dist_bar=[1, 3, 5]. You can also specify the number of attention heads, the number of encodes, the dimension size, the dropout rate, and etc, There we only adopt the defaults.

>>> import torch 
>>> from model.tr_spe import build_model

# initialize the model 
>>> model = build_model(N, n).cuda()

# take a 4-atom molecule for example
>>> x = torch.tensor([[1, 1, 6, 8]]).cuda()
>>> pos = torch.tensor([[[7.356203877, 9.058198382, 3.255188164],
                         [5.990730587, 3.951633382, 9.784664946],
                         [1.048332315, 3.912215133, 9.827313903],
                         [2.492201352, 9.097616820, 3.297837121]]]).cuda()
>>> mask = (x != 0).unsqueeze(1)
>>> out = model(x.long(), mask, pos)
>>> import torch 
>>> from model.tr_msa import build_model

# initialize the model 
>>> model = build_model(N, n, dist_bar).cuda()

# take a 4-atom molecule for example
>>> x = torch.tensor([[1, 1, 6, 8]]).cuda()
>>> pos = torch.tensor([[[7.356203877, 9.058198382, 3.255188164],
                         [5.990730587, 3.951633382, 9.784664946],
                         [1.048332315, 3.912215133, 9.827313903],
                         [2.492201352, 9.097616820, 3.297837121]]]).cuda()
>>> mask = (x != 0).unsqueeze(1)
>>> dist = torch.cdist(pos, pos).float()
>>> out = model(x.long(), mask, dist)

Motif Extraction

We reply on RDKit to extract motifs in small molecules. Given the SMILES representation of any molecule, we can manually define the substructures using Smarts.

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles(smiles)
>>> pattern = Chem.MolFromSmarts('C(=O)')
>>> mol.HasSubstructMatch(pattern) # check whether the molecule has the motif 'C(=O)'
>>> mol.GetSubstructMatches(pattern) # get atoms that belong to the motif 'C(=O)'