DeepSS2GO is a deep learning predictor th incorporates secondary structure features along with primary sequence and homology information. The algorithm expertly harnesses the speed of sequence-based analysis and the accuracy of structure-based analysis, streamlining primary sequences and elegantly sidestepping the time-consuming challenges of tertiary structure analysis. The results show that the prediction performance surpasses state-of-the-art algorithms. It has the ability to predict key functions by effectively utilizing secondary structure information, rather than broadly predicting general Gene Ontology terms. Additionally, DeepSS2GO predicts five times faster than advanced algorithms, making it highly applicable to massive sequencing data.
For Colab Pro users, it is recommended to use DeepSS2GO_v2_colab_pro.ipynb
. Just upload the fasta file in the first step, and you can run all cells with one click.
For Colab free users, it is recommended to use DeepSS2GO_v2_colab_free.ipynb
. It needs to predict the corresponding secondary structure (8-class) of the primary amino acid sequence on other websites and upload both together.
To be noticed, quick and fully use requires Colab Pro for higher RAM. Details can be found here:
# Colab Pro, GPU=L4, RAM=53GB, GPURAM=22.5GB, Disk=201GB
https://colab.research.google.com/github/orca233/DeepSS2GO_v2_colab/blob/main/DeepSS2GO_v2_colab_pro.ipynb
# Colab Free, GPU=T4, RAM=12.7GB, GPURAM=15GB, Disk=78GB
https://colab.research.google.com/github/orca233/DeepSS2GO_v2_colab/blob/main/DeepSS2GO_v2_colab_free.ipynb
# Details
https://github.com/orca233/DeepSS2GO_v2_colab/
An update free colab version can be find here, with NetSurfP3.0 to predict ss8 from aa. Take a try this one:
https://colab.research.google.com/github/orca233/DeepSS2GO_v2_colab/blob/main/DeepSS2GO_v2_colab_free_NetSurfP3.ipynb
DeepSS2GO is developed under Linux environment with:
# Name Version
blast 2.5.0
click 8.1.3
diamond 2.1.7
fair-esm 2.0.0
matplotlib 3.7.1
numpy 1.24.3
pandas 1.3.5
pip 23.1.2
python 3.8.16
scikit-learn 1.2.2
scipy 1.10.1
seaborn 0.12.2
torch 1.8.0+cu111
torchaudio 0.8.0
torchsummary 1.5.1
torchvision 0.9.0+cu111
tqdm 4.65.0
transformers 4.29.2
Download and setup the following pre-trained models:
For aa to ss8 (step 2):
- ESM1b_t33_650M_UR50S
- Prot_T5_XL_UniRef50
- SPOT1DLM_checkpoints
Simply run, predict BPO/CCO/MFO in a batch step 3.1:
- s3_AlphaBeta_bpccmf
For higher precision purpose, predict BPO/CCO/MFO separately step 3.2:
- s3_AlphaBeta_TrainALL00_TestALL00_bp_aaK16F32768_ss8K32F32768
- s3_AlphaBeta_TrainALL00_TestALL00_cc_aaK16F32768_ss8K48F16384
- s3_AlphaBeta_TrainALL00_TestALL00_mf_aaK16F32768_ss8K32F32768
Download links:
# ESM-1b
# Save to: /home/USERNAME/.cache/torch/hub/checkpoints/esm1b_t33_650M_UR50S-contact-regression.pt & esm1b_t33_650M_UR50S.pt
https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
# ProtTrans
# Save to: /home/USERNAME/.../Prot_T5_XL_UniRef50/
https://huggingface.co/Rostlab/prot_t5_xl_uniref50/tree/main
# Modify path_Prot_T5_XL_UniRef50 in step0_DataPreprocessingSetting.py according to previous path
path_Prot_T5_XL_UniRef50 = /home/USERNAME/.../Prot_T5_XL_UniRef50/
# SPOT1DLM_checkpoints.xz
# Unpack and save to: /home/fsong/work/py_proj/prot_algo/DeepSS2GO_v1/pub_data/SPOT1DLM_checkpoints
wget https://huggingface.co/orca233/DeepSS2GO/resolve/main/SPOT-LM-checkpoints.xz
# s3_AlphaBeta_bpccmf/
# s3_AlphaBeta_TrainALL00_TestALL00_bp_aaK16F32768_ss8K32F32768/
# s3_AlphaBeta_TrainALL00_TestALL00_cc_aaK16F32768_ss8K48F16384/
# s3_AlphaBeta_TrainALL00_TestALL00_mf_aaK16F32768_ss8K32F32768/
# Unpack and save to: ..../DeepSS2GO/PredictNew/s3_PredictNew_AlphaBeta/
wget https://huggingface.co/orca233/DeepSS2GO/resolve/main/s3_AlphaBeta_bpccmf.tar.gz?download=true
wget https://huggingface.co/orca233/DeepSS2GO/resolve/main/s3_AlphaBeta_TrainALL00_TestALL00_bp_aaK16F32768_ss8K32F32768.tar.gz?download=true
wget https://huggingface.co/orca233/DeepSS2GO/resolve/main/s3_AlphaBeta_TrainALL00_TestALL00_cc_aaK16F32768_ss8K48F16384.tar.gz?download=true
wget https://huggingface.co/orca233/DeepSS2GO/resolve/main/s3_AlphaBeta_TrainALL00_TestALL00_mf_aaK16F32768_ss8K32F32768.tar.gz?download=true
In this section, primary amino sequence (aa) will be converted to secondary structure (ss8) by modified SPOT-1D-LM algorithm Ref.
/pub_data/data_new/new_aa.fa
Input fasta file format example:
>slam1
MVIFYFCGKTFMPARNRWMLLLPLLASAAYAEETPREPDLRSRPEFRLHEAEVKPIDREKVPGQVREKGKVLQIDGETLLKNPELLSRAMYSAVVSNNIAGIRVILPIYLQQAQQDKMLALYAQGILAQADGRVKEAISHYRELIAAQPDAPAVRMRLAAALFENRQNEAAADQFDRLKAENLPPQLMEQVELYRKALRERDAWKVNGGFSVTREHNINQAPKRQQYGKWTFPKQVDGTAVNYRLGAEKKWSLKNGWYTTAGGDVSGRVYPGNKKFNDMTAGVSGGIGFADRRKDAGLAVFHERRTYGNDAYSYTNGARLYFNRWQTPKWQTLSSAEWGRLKNTRRARSDNTHLQISNSLVFYRNARQYWMGGLDFYRERNPADRGDNFNRYGLRFAWGQEWGGSGLSSLLRLGAAKRHYEKPGFFSGFKGERRRDKELNTSLSLWHRALHFKGITPRLTLSHRETRSNDVFNEYEKNRAFVEFNKTF
>slam2
MLYFRYGFLVVWCAAGVSAAYGADAPAILDDKALLQVQRSVSDKWAESDWKVENDAPRVVDGDFLLAHPKMLEHSLRDALNGNQADLIASLADLYAKLPDYDAVLYGRARALLAKLAGRPAEAVARYRELHGENAADERILLDLAAAEFDDFRLKSAERHFAEAAKLDLPAPVLENVGRFRKKTEGLTGWRFSGGISPAVNRNANNAAPQYCRQNGGRQICSVSRAERAAGLNYEIEAEKLTPLADNHYLLFRSNIGGTSYYFSKKSAYDDGFGRAYLGWQYKNARQTAGILPFYQVQLSGSDGFDAKTKRVNNRRLPPYMLAHGVGVQLSHTYRPNPGWQFSVALEHYRQRYREQDRAEYNNGRQDGFYVSSAKRLGESATVFGGWQFVRFVPKRETVGGAVNNAAYRRNGVYAGWAQEWRQLGGLNSRVSASYARRNYKGIAAFSTEAQRNREWNVSLALSHDKLSYKGIVPALNYRFGRTESNVPYAKRRNSEVFVSADWRF
In case your protein name contains dot '.', or sequence in multi-line, please run
utils_modified_input_fasta.py
to uniform inputnew_aa.fa
as the example format.
Navigate to the directory:
..../DeepSS2GO/PredictNew/s1_DataPreprocessing_PredictNew/
Execute steps 1-8 in s1_DataPreprocessing_New/
. The following final files will be generated in /pub_data/data_new/
:
If you require general predictions, run only step 3.1. For higher precision, proceed with step 3.2 instead.
Navigate to the directory and run:
..../DeepSS2GO/PredictNew/s3_PredictNew_AlphaBeta/s3_AlphaBeta_bpccmf/
# Modify (step6_cpData_Diamond4New.sh) with your own path
path_base="/home/USERNAME/work/py_proj/prot_algo/DeepSS2GO/"
bash step6_cpData_Diamond4New.sh # Copy these four *pkl/fa files to the corresponding directories and run diamond
bash step7_PredictAlphaBeta_New.sh # Set the threshold accordingly
Find Results in directory: /data/
as:
Take BPO as example:
Navigate to the directory and perform the same steps as step 3.1:
..../DeepSS2GO/PredictNew/s3_PredictNew_AlphaBeta/s3_AlphaBeta_TrainALL00_TestALL00_bp_aaK16F32768_ss8K32F32768/
Same for CCO and MFO.
The paper has been published by Briefings in Bioinformatics
.
https://academic.oup.com/bib/article/25/3/bbae196/7663430
Please cite it as follows:
@article{song2024deepss2go,
title={DeepSS2GO: protein function prediction from secondary structure},
author={Song, Fu V and Su, Jiaqi and Huang, Sixing and Zhang, Neng and Li, Kaiyue and Ni, Ming and Liao, Maofu},
journal={Briefings in Bioinformatics},
volume={25},
number={3},
year={2024},
publisher={Oxford University Press}
doi = {10.1093/bib/bbae196},
url = {https://doi.org/10.1093/bib/bbae196},
}
Fu Song (songf@mail.sustech.edu.cn)
Ming Ni (niming@mgi-tech.com)
Maofu Liao (liaomf@sustech.edu.cn)