juanniwu / t-SMILES

12 stars 2 forks source link

t-SMILES: A Scalable Fragment-based Molecular Representation Framework

When using advanced NLP methodologies to solve chemical problems, two fundamental questions arise: 1) What are 'chemical words'? and 2) How can they be encoded as 'chemical sentences’?

This study introduces a scalable, fragment-based, multiscale molecular representation algorithm called t-SMILES (tree-based SMILES) to address the second question. It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph.

For more details, please refer to the papers.

TSSA, TSDY, TSID: https://www.nature.com/articles/s41467-024-49388-6

TSIS (TSIS, TSISD, TSISO, TSISR): https://arxiv.org/abs/2402.02164

Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show that:

  1. It can build a multi-code molecular description system, where various descriptions complement each other, enhancing the overall performance. Under this framework, classical SMILES can be unified as a special case of t-SMILES to achieve better balanced performance using hybrid decomposition algorithms.

  2. It exhibits impressive performance on low-resource datasets JNK3 and AID1706, whether the model is original, data augmented, or pre-training fine-tuned;

  3. It significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks.

  4. It outperforms previous fragment-based models being competitive with classical SMILES and graph-based methods on Zinc, QM9, and ChEMBL.

To support the t-SMILES algorithm, we introduce a new character, '&', to act as a tree node when the node is not a real fragment in FBT. Additionally, we introduce another new character, '\^', to separate two adjacent substructure segments in t-SMILES string, similar to the blank space in English sentences that separates two words.

Four coding algorithms are presented in these studies:

  1. TSSA: t-SMILES with shared atom.

  2. TSDY: t-SMILES with dummy atom but without ID.

  3. TSID: t-SMILES with ID and dummy atom.

  4. TSIS: Simplified TSID, including TSIS, TSISD, TSISO, TSISR.

For example, the six t-SMILES codes of Celecoxib are:

TSID_M:

TSDY_M (replace [n*] with *):

TSSA_M:

TSIS_M:

TSISD_M:

TSISO_M:

Here we provide the source code of our method.

Dependencies

We recommend Anaconda to manage the version of Python and installed packages.

Please make sure the following packages are installed:

  1. Python(version >= 3.7)

  2. PyTorch (version == 1.7)

  3. RDKit (version >= 2020.03)

  4. Networkx(version >= 2.4)

  5. Numpy (version >= 1.19)

  6. Pandas (version >= 1.2.2)

  7. Matplotlib (version >= 2.0)

  8. Scipy(version >= 1.4.1)

As to Datamol and rBRICS: please download them from https://github.com/datamol-io/datamol and https://github.com/BiomedSciAI/r-BRICS and copy them into the MolUtils folder.

Usage

  1. DataSet/Graph/CNJTMol.py

encode_single ()

It contained a preprocess function to generate t-SMILES from data set.

  1. DataSet/Graph/CNJMolAssembler.py

decode_single()

It reconstructs molecules form t-SMILES to generate classical SMILES.

In this study, GPT and RNN generative models are used for evaluation.

Acknowledgement

We thank the following Git repositories that gave me a lot of inspirations:

  1. Datamol: https://github.com/datamol-io/datamol

  2. rBRICS: https://github.com/BiomedSciAI/r-BRICS

  3. MolGPT : https://github.com/devalab/molgpt

  4. MGM: https://github.com/nyu-dl/dl4chem-mgm

  5. JTVAE: https://github.com/wengon-jin/icml18-jtnn

  6. hgraph2graph: https://github.com/wengong-jin/hgraph2graph

  7. DeepSmiles: https://github.com/baoilleach/deepsmiles

  8. SELFIES: https://github.com/aspuru-guzik-group/selfies

  9. FragDGM: https://github.com/marcopodda/fragment-based-dgm

  10. CReM: https://github.com/DrrDom/crem

  11. AttentiveFP: https://github.com/OpenDrugAI/AttentiveFP

  12. Guacamol: https://github.com/BenevolentAI/guacamol\_baselines

  13. MOSES: https://github.com/molecularsets/moses

  14. GPT2: https://github.com/samwisegamjeee/pytorch-transformers