The GraphKM toolbox is a Python package for prediction of KMs.
Assuming that you use Miniconda or Anaconda. In a terminal execute:
conda env create -n GraphKM python=3.8
conda activate GraphKM
Requirement packages:
paddlehelix==1.0.1
pgl==2.2.4
paddlepaddle-gpu==2.3.2
matplotlib
scikit-learn
rdkit
PubChemPy
xgboost==1.7.5
hyperopt==0.2.7
ESM
Note: paddlepaddle-gpu==2.3.2
is installed by command line conda install paddlepaddle-gpu==2.3.2 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge
.
Please refer to this github site for ESM installation.
Before data preprocessing, a json file and a csv file should be ready. The json file and the csv file is generated by KM_data_clean/generate_esm_vector_gpu.py
. Run following codes:
python generate_esm_vector_gpu.py -i my_data.json -o sequences_embeddings.csv
python data_preprocess.py -i my_data.json -l KM -input_seq my_protein_sequences_embeddings.csv -o my_dataset.npz
The training needs big memory if you use GPU for acceleration. Suggestion that the memory of your GPU is 24 GB.
python train.py -d path_to/my_dataset.npz --model_config path_to/gin_config.json -l KM -- model_dir path_to/ --results_dir path_to/
python train_xgb.py -i path_to/my_data.json -l KM -input_seq path_to/my_protein_sequences_embeddings.csv -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config path_to/gin_config.json
Methods | MSE | r.m.s.e. | R2 |
---|---|---|---|
GIN-based | 0.639 | 0.799 | 0.614 |
GAT-based | 0.709 | 0.842 | 0.572 |
GCN-based | 0.671 | 0.819 | 0.595 |
GAT_GCN-based | 0.627 | 0.792 | 0.622 |
Note: The trained models are available in the Figshare database with DOI: 10.6084/m9.figshare.25335049.
The input for prediction.py:
If you want to predict KM values of different seuqences corresponding to different substrate SMILES codes, use csv file as input. The format of csv file please refer to the example.csv file. The commond line example for prediction:
python prediction.py -c --csv_file example.csv -l KM -input_seq example.tsv -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config gin_config.json -xgb path_to/gin_xgboost_model.dat
If you want to predict KM values of different seuqences corresponding to one type substrate SMILES codes, use FASTA file as input.
commond line example for prediction:
python prediction.py -l KM -f --fasta_file example.fasta -input_seq my_sequences_embeddings.tsv -S substrate.txt -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config path_to/gin_config.json -xgb path_to/gin_xgboost_model.dat
We manually collected an independent KM dataset (HXKm) from literatures. The HXKm dataset had be published at this journal.
Enter -h
tag for more helps.
python data_preprocess.py -h
python train.py -h
python train_xgb.py -h
python prediction.py -h
He, X., Yan, M. GraphKM: machine and deep learning for KM prediction of wildtype and mutant enzymes. BMC Bioinformatics 25, 135 (2024). https://doi.org/10.1186/s12859-024-05746-1