lemaslab / CAMP

predicting peptide-protein interactions
2 stars 2 forks source link

DEBUG: step1_pdb_process.py #7

Open dlemas opened 1 year ago

dlemas commented 1 year ago

Please work to debug the step1_pdb_process.py script.

Script: https://github.com/lemaslab/CAMP/blob/master/data_prepare/step1_pdb_process.py

Input Data (RCSB PDB) : Download the fasta files from ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz and pdb files

Programs: PLIP

For each peptide-protein pair, the peptide sequence was directly obtained from the RCSB PDB with binding residues marked by PepBDB and the protein sequence was obtained by mapping to UniProt [12]. We first downloaded all complexes containing peptides as ligands from the RCSB PDB released by September 2019. Then we used the Protein Ligand Interaction Predictor (PLIP) program [10] (http://github.com/ssalentin/plip) to extract the interacting chains of peptide and protein sequences from the complex structures. Given a complex structure, PLIP recognizes seven types of non-covalent interactions, including hydrogen bonds, hydrophobic interactions, pi-stackings, pi-cations, salt bridges, water bridges and halogen bonds. A residue from the peptide and another one from the protein, with at least one noncovalent interaction was considered as an interacting pair. We then retrieved the corresponding interacting labels from PepBDB [11], a structure database of peptide-protein complexes derived from the RCSB Protein Data Bank (PDB) [3–5], which contains the peptide residues involved in hydrogen bonds and hydrophobic ontacts with the partner proteins. The peptide binding residues detected by PepBDB were then mapped to the peptide sequences (which were annotated from the RSCB PDB) using an alignment tool based on the Smith-Waterman algorithm [21] (https://github.com/mengyao/Complete-Striped-SmithWaterman-Library). To achieve the high quality of the data, we only kept those peptide sequences with at least 80% matched residues. In total, we collected 7,233 peptide-protein pairs with 3,318 distinct protein sequences and 5,283 distinct peptide sequences, and 90.99% of the pairs had labels of peptide binding residues.

dlemas commented 1 year ago

I have been playing with CAMP this weekend and the first task I would like you to work on is running the simple script we reviewed on Friday, across the large PDB dataset. The other downstream functions we will modify/test are dependent on the output of the step1 script.

The pdb file contains >1 million lines of data and will likely need to be run on the supercomputer.

Please create a SLURM script to run "step1_pdb_parse1.py" that consumes pdb_seqres.txt (datafile). I have also included the small dataset (pdb_seqres_small.txt).

The expected output will be:

I have emailed you a link with the data. Please let me know if you are having trouble gaining access to UFRC. I was able to log in this morning. I am not familiar with running python code (generally) and have not worked on running python code on UFRC. Please reach out to folks at UFRC if you need help. There are a lot of training opportunities and online documentation.

dlemas commented 1 year ago

I was able to run the SLURM script that is located in:

/data_prepare/SLURM

Please confirm this works across the group.

evanhadam commented 1 year ago

From Dr. Lemas I would like the coding team to 1) reproduce my work and 2) standardize input/outputs with meaningful names, ensure the tab/spaces are consistent across functions, and ultimately engineer a system to link all the pieces together into a simple script. I would also like to output the data for each function (to review, manually) and call these functions with a separate script (run_step1.py). Please complete this work on a branch (dev) that can be merged into the main branch. I will also start to work on github branches.

A couple of notes to help the group reproduce my work. I have split up the pieces of the code into smaller chunks that are easier to handle. The chunks are functions that are clearly described.

AnthonyYao7 commented 1 year ago

Dr. Lemas, Re the email I sent yesterday, 9/27, I believe that using an offline version of PLIP is more appropriate, since we are analyzing several hundred thousand proteins. I think requesting them off of the website will take much more time than running them on the server.