GraphKM: machine and deep learning for KM prediction of wildtype and mutant enzymes

Introduction

The GraphKM toolbox is a Python package for prediction of KMs.

Requirements

Assuming that you use Miniconda or Anaconda. In a terminal execute:

conda env create -n GraphKM python=3.8
conda activate GraphKM

Requirement packages:

paddlehelix==1.0.1
pgl==2.2.4
paddlepaddle-gpu==2.3.2
matplotlib
scikit-learn
rdkit
PubChemPy
xgboost==1.7.5
hyperopt==0.2.7
ESM

Note: paddlepaddle-gpu==2.3.2 is installed by command line conda install paddlepaddle-gpu==2.3.2 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge.

Please refer to this github site for ESM installation.

Input files

Before data preprocessing, a json file and a csv file should be ready. The json file and the csv file is generated by KM_data_clean/generate_esm_vector_gpu.py. Run following codes:

python generate_esm_vector_gpu.py -i my_data.json -o sequences_embeddings.csv

Train

Preprocess

python data_preprocess.py -i my_data.json -l KM -input_seq my_protein_sequences_embeddings.csv -o my_dataset.npz

Training

The training needs big memory if you use GPU for acceleration. Suggestion that the memory of your GPU is 24 GB.

python train.py -d path_to/my_dataset.npz --model_config path_to/gin_config.json -l KM -- model_dir path_to/ --results_dir path_to/

python train_xgb.py -i path_to/my_data.json -l KM -input_seq path_to/my_protein_sequences_embeddings.csv -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config path_to/gin_config.json

Training results

Methods	MSE	r.m.s.e.	R2
GIN-based	0.639	0.799	0.614
GAT-based	0.709	0.842	0.572
GCN-based	0.671	0.819	0.595
GAT_GCN-based	0.627	0.792	0.622

Note: The trained models are available in the Figshare database with DOI: 10.6084/m9.figshare.25335049.

Prediction

The input for prediction.py:

If you want to predict KM values of different seuqences corresponding to different substrate SMILES codes, use csv file as input. The format of csv file please refer to the example.csv file. The commond line example for prediction:
```
python prediction.py -c --csv_file example.csv -l KM -input_seq example.tsv -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config gin_config.json -xgb path_to/gin_xgboost_model.dat
```
If you want to predict KM values of different seuqences corresponding to one type substrate SMILES codes, use FASTA file as input.

commond line example for prediction:
```
python prediction.py -l KM -f --fasta_file example.fasta -input_seq my_sequences_embeddings.tsv -S substrate.txt -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config path_to/gin_config.json -xgb path_to/gin_xgboost_model.dat
```
Independent dataset

We manually collected an independent KM dataset (HXKm) from literatures. The HXKm dataset had be published at this journal.

Tip

Enter -h tag for more helps.

python data_preprocess.py -h
python train.py -h
python train_xgb.py -h
python prediction.py -h

Citation

He, X., Yan, M. GraphKM: machine and deep learning for KM prediction of wildtype and mutant enzymes. BMC Bioinformatics 25, 135 (2024). https://doi.org/10.1186/s12859-024-05746-1

realHXiao / GraphKM

readme