Steps for data preprocessing

sj584 commented 1 year ago

This example is data preprocessing from Epitope3D external test set "epitope3d_dataset_45_Blind_Test.csv"

Tutorials

Prerequisits: csv file of PDB ID in Data_processing directory (w/wo epitope labels)

collect_pdb.py --> example_pdb/*.pdb
collect_fasta.py --> example_fasta/*.fasta
Make PSAIA data and PSSM data by...

PSAIA: Structure Analyser: Accessible Surface Area, Depth Index, Protrusion Index, Hydrophobicity Analyse as Bound, Table Output

PSI-BLAST. psi_blast -query example.fasta -db swissprot -num_iterations 3 -out_ascii_pssm example.pssm

After this process, you'll get

example_psaia/.tbl, example_pssm/.pssm

generate_graphs.py --> example_graphs_5A.pkl
remove_symmetry.py --> example_nonsym_graphs_5A.pkl
PSAIA_PSSM_2_pkl.py --> example_psaia.pkl, example_pssm.pkl
generate_label.py (not necessary when you don't have epitope labels) --> example_label_dict.pkl
generate_dataset.py --> example_pygsurface"RSA".pkl

kgzhang commented 1 year ago

Can you provide the files (4, 5, 6, 7) you mentioned above? i would like to reproduce the amazing result you've made in your paper, thanks!

sj584 commented 1 year ago

Thanks for Reaching out! I initially didn't upload above files because they were too heavy.

I uploaded the files in this google drive link down below. https://drive.google.com/drive/folders/1erW8dht3YB6dAH6Z1YkbXHefgxfR_qjd?usp=share_link I'll fix the README so that everyone can access to the files easily

kgzhang commented 1 year ago

Thanks for you reply! I found it has only 45 example pdf files in your data, which is not suitable for the model training. I was wondering if you could provide the source python files you used for data processing ?

Thest files (generate_graphs.py, remove_symmetry.py, PSAIA_PSSM_2_pkl.py, generate_label.py) is missing in your git repo.

sj584 commented 1 year ago

Thank you for pointing our my mistakes.

I uploaded the .py files you noted in the github repository. Also additionally uploaded the preprocessed epitope3D dataset in the google drive link.

train_pyg_surface_0.15 (180 PDB)
test_pyg_surface_0.15. (20 PDB)

After you import them using pickle, you'll get lists. Just add two lists like: total_train = train_pyg_surface_0.15 + test_pyg_surface_0.15

Model (.pt) trained on 200 PDB was stored in the checkpoint file.

sj584 / EpiCluster

Steps for data preprocessing #1