sj584 / EpiCluster

Apache License 2.0
3 stars 0 forks source link

Steps for data preprocessing #1

Open sj584 opened 1 year ago

sj584 commented 1 year ago

This example is data preprocessing from Epitope3D external test set "epitope3d_dataset_45_Blind_Test.csv"

Tutorials

Prerequisits: csv file of PDB ID in Data_processing directory (w/wo epitope labels)

  1. collect_pdb.py --> example_pdb/*.pdb

  2. collect_fasta.py --> example_fasta/*.fasta

  3. Make PSAIA data and PSSM data by...

PSAIA: Structure Analyser: Accessible Surface Area, Depth Index, Protrusion Index, Hydrophobicity Analyse as Bound, Table Output

PSI-BLAST. psi_blast -query example.fasta -db swissprot -num_iterations 3 -out_ascii_pssm example.pssm

After this process, you'll get

example_psaia/.tbl, example_pssm/.pssm

  1. generate_graphs.py --> example_graphs_5A.pkl

  2. remove_symmetry.py --> example_nonsym_graphs_5A.pkl

  3. PSAIA_PSSM_2_pkl.py --> example_psaia.pkl, example_pssm.pkl

  4. generate_label.py (not necessary when you don't have epitope labels) --> example_label_dict.pkl

  5. generate_dataset.py --> example_pygsurface"RSA".pkl

kgzhang commented 1 year ago

Can you provide the files (4, 5, 6, 7) you mentioned above? i would like to reproduce the amazing result you've made in your paper, thanks!

sj584 commented 1 year ago

Thanks for Reaching out! I initially didn't upload above files because they were too heavy.

I uploaded the files in this google drive link down below. https://drive.google.com/drive/folders/1erW8dht3YB6dAH6Z1YkbXHefgxfR_qjd?usp=share_link I'll fix the README so that everyone can access to the files easily

kgzhang commented 1 year ago

Thanks for you reply! I found it has only 45 example pdf files in your data, which is not suitable for the model training. I was wondering if you could provide the source python files you used for data processing ?

Thest files (generate_graphs.py, remove_symmetry.py, PSAIA_PSSM_2_pkl.py, generate_label.py) is missing in your git repo.

sj584 commented 1 year ago

Thank you for pointing our my mistakes.

I uploaded the .py files you noted in the github repository. Also additionally uploaded the preprocessed epitope3D dataset in the google drive link.

  1. train_pyg_surface_0.15 (180 PDB)
  2. test_pyg_surface_0.15. (20 PDB)

After you import them using pickle, you'll get lists. Just add two lists like: total_train = train_pyg_surface_0.15 + test_pyg_surface_0.15

Model (.pt) trained on 200 PDB was stored in the checkpoint file.