Open dlemas opened 1 year ago
I have been playing with CAMP this weekend and the first task I would like you to work on is running the simple script we reviewed on Friday, across the large PDB dataset. The other downstream functions we will modify/test are dependent on the output of the step1 script.
The pdb file contains >1 million lines of data and will likely need to be run on the supercomputer.
Please create a SLURM script to run "step1_pdb_parse1.py" that consumes pdb_seqres.txt (datafile). I have also included the small dataset (pdb_seqres_small.txt).
The expected output will be:
I have emailed you a link with the data. Please let me know if you are having trouble gaining access to UFRC. I was able to log in this morning. I am not familiar with running python code (generally) and have not worked on running python code on UFRC. Please reach out to folks at UFRC if you need help. There are a lot of training opportunities and online documentation.
I was able to run the SLURM script that is located in:
/data_prepare/SLURM
Please confirm this works across the group.
From Dr. Lemas I would like the coding team to 1) reproduce my work and 2) standardize input/outputs with meaningful names, ensure the tab/spaces are consistent across functions, and ultimately engineer a system to link all the pieces together into a simple script. I would also like to output the data for each function (to review, manually) and call these functions with a separate script (run_step1.py). Please complete this work on a branch (dev) that can be merged into the main branch. I will also start to work on github branches.
A couple of notes to help the group reproduce my work. I have split up the pieces of the code into smaller chunks that are easier to handle. The chunks are functions that are clearly described.
step1_pdb_process.py has 230 lines of code that cover 5 functions. I have successfully run 175 lines and 4 functions. The last chunk does not look tricky but my demo dataset is not sufficient to complete the tasks. More below.
step0- this function parses the raw PDB file (pdb_seqres.txt) into 2 output files (pdb_pep_chain and pdbid_all_fasta). I have created a SLURM script and python script to run this on HiPerGator as there at 1 million lines of data and this cant be run on my laptop. The SLURM script is 00_run_python_step1_parse1.sh & the python code is step1_pdb_parse1.py.
step1- this function consumes the pdb_pep_chain from step0 and - to the best of my knowledge - is expecting a subset of proteins that are ligands and the corresponding PLIP files for each protein/ligand. PLIP is a tool that predicts specific amino acids within a protein that "might" be involved with Protein-ligand interactions (thus the name PLIP!). Given the raw pdb dataset has hundreds if not thousands of proteins, I read over the paper and identified "benchmark" datasets that were analyzed. I have created a github issue for the PPDbench dataset we need to create. This includes 133 PDBs that can be downloaded and used for downstream analysis. To test the code, I included 2 proteins ("1cjr","1cvu") and created the PLIP files for these PDBs. A few notes on PLIP.
step1_pdb_parse2.rmd- I created an R script to subset the step0 files to include ONLY the demo proteins from the PPDbench dataset.
Dr. Lemas, Re the email I sent yesterday, 9/27, I believe that using an offline version of PLIP is more appropriate, since we are analyzing several hundred thousand proteins. I think requesting them off of the website will take much more time than running them on the server.
Please work to debug the step1_pdb_process.py script.
Script: https://github.com/lemaslab/CAMP/blob/master/data_prepare/step1_pdb_process.py
Input Data (RCSB PDB) : Download the fasta files from ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz and pdb files
Programs: PLIP
For each peptide-protein pair, the peptide sequence was directly obtained from the RCSB PDB with binding residues marked by PepBDB and the protein sequence was obtained by mapping to UniProt [12]. We first downloaded all complexes containing peptides as ligands from the RCSB PDB released by September 2019. Then we used the Protein Ligand Interaction Predictor (PLIP) program [10] (http://github.com/ssalentin/plip) to extract the interacting chains of peptide and protein sequences from the complex structures. Given a complex structure, PLIP recognizes seven types of non-covalent interactions, including hydrogen bonds, hydrophobic interactions, pi-stackings, pi-cations, salt bridges, water bridges and halogen bonds. A residue from the peptide and another one from the protein, with at least one noncovalent interaction was considered as an interacting pair. We then retrieved the corresponding interacting labels from PepBDB [11], a structure database of peptide-protein complexes derived from the RCSB Protein Data Bank (PDB) [3–5], which contains the peptide residues involved in hydrogen bonds and hydrophobic ontacts with the partner proteins. The peptide binding residues detected by PepBDB were then mapped to the peptide sequences (which were annotated from the RSCB PDB) using an alignment tool based on the Smith-Waterman algorithm [21] (https://github.com/mengyao/Complete-Striped-SmithWaterman-Library). To achieve the high quality of the data, we only kept those peptide sequences with at least 80% matched residues. In total, we collected 7,233 peptide-protein pairs with 3,318 distinct protein sequences and 5,283 distinct peptide sequences, and 90.99% of the pairs had labels of peptide binding residues.