DEBUG: step1_pdb_process.py

dlemas commented 1 year ago

Please work to debug the step1_pdb_process.py script.

Script: https://github.com/lemaslab/CAMP/blob/master/data_prepare/step1_pdb_process.py

Input Data (RCSB PDB) : Download the fasta files from ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz and pdb files

Programs: PLIP

For each peptide-protein pair, the peptide sequence was directly obtained from the RCSB PDB with binding residues marked by PepBDB and the protein sequence was obtained by mapping to UniProt [12]. We first downloaded all complexes containing peptides as ligands from the RCSB PDB released by September 2019. Then we used the Protein Ligand Interaction Predictor (PLIP) program [10] (http://github.com/ssalentin/plip) to extract the interacting chains of peptide and protein sequences from the complex structures. Given a complex structure, PLIP recognizes seven types of non-covalent interactions, including hydrogen bonds, hydrophobic interactions, pi-stackings, pi-cations, salt bridges, water bridges and halogen bonds. A residue from the peptide and another one from the protein, with at least one noncovalent interaction was considered as an interacting pair. We then retrieved the corresponding interacting labels from PepBDB [11], a structure database of peptide-protein complexes derived from the RCSB Protein Data Bank (PDB) [3–5], which contains the peptide residues involved in hydrogen bonds and hydrophobic ontacts with the partner proteins. The peptide binding residues detected by PepBDB were then mapped to the peptide sequences (which were annotated from the RSCB PDB) using an alignment tool based on the Smith-Waterman algorithm [21] (https://github.com/mengyao/Complete-Striped-SmithWaterman-Library). To achieve the high quality of the data, we only kept those peptide sequences with at least 80% matched residues. In total, we collected 7,233 peptide-protein pairs with 3,318 distinct protein sequences and 5,283 distinct peptide sequences, and 90.99% of the pairs had labels of peptide binding residues.

dlemas commented 1 year ago

I have been playing with CAMP this weekend and the first task I would like you to work on is running the simple script we reviewed on Friday, across the large PDB dataset. The other downstream functions we will modify/test are dependent on the output of the step1 script.

The pdb file contains >1 million lines of data and will likely need to be run on the supercomputer.

Please create a SLURM script to run "step1_pdb_parse1.py" that consumes pdb_seqres.txt (datafile). I have also included the small dataset (pdb_seqres_small.txt).

The expected output will be:

pdbid_all_fasta
pdb_pep_chain

I have emailed you a link with the data. Please let me know if you are having trouble gaining access to UFRC. I was able to log in this morning. I am not familiar with running python code (generally) and have not worked on running python code on UFRC. Please reach out to folks at UFRC if you need help. There are a lot of training opportunities and online documentation.

dlemas commented 1 year ago

I was able to run the SLURM script that is located in:

/data_prepare/SLURM

Please confirm this works across the group.

evanhadam commented 1 year ago

From Dr. Lemas I would like the coding team to 1) reproduce my work and 2) standardize input/outputs with meaningful names, ensure the tab/spaces are consistent across functions, and ultimately engineer a system to link all the pieces together into a simple script. I would also like to output the data for each function (to review, manually) and call these functions with a separate script (run_step1.py). Please complete this work on a branch (dev) that can be merged into the main branch. I will also start to work on github branches.

A couple of notes to help the group reproduce my work. I have split up the pieces of the code into smaller chunks that are easier to handle. The chunks are functions that are clearly described.

step1_pdb_process.py has 230 lines of code that cover 5 functions. I have successfully run 175 lines and 4 functions. The last chunk does not look tricky but my demo dataset is not sufficient to complete the tasks. More below.
- step0- this function parses the raw PDB file (pdb_seqres.txt) into 2 output files (pdb_pep_chain and pdbid_all_fasta). I have created a SLURM script and python script to run this on HiPerGator as there at 1 million lines of data and this cant be run on my laptop. The SLURM script is 00_run_python_step1_parse1.sh & the python code is step1_pdb_parse1.py.
  - Please run these scripts, and rename them to be consistent with parent script nomenclature. @Yao, Anthony J.@Hadam, Evan J.
    - 00_run_python_step1_parse0.sh
    - 00_step1_pdb_parse0.py
- step1- this function consumes the pdb_pep_chain from step0 and - to the best of my knowledge - is expecting a subset of proteins that are ligands and the corresponding PLIP files for each protein/ligand. PLIP is a tool that predicts specific amino acids within a protein that "might" be involved with Protein-ligand interactions (thus the name PLIP!). Given the raw pdb dataset has hundreds if not thousands of proteins, I read over the paper and identified "benchmark" datasets that were analyzed. I have created a github issue for the PPDbench dataset we need to create. This includes 133 PDBs that can be downloaded and used for downstream analysis. To test the code, I included 2 proteins ("1cjr","1cvu") and created the PLIP files for these PDBs. A few notes on PLIP.
  - PLIP CLI via docker- very cool but the output is XML and we need txt files.
  - PLIP server: https://plip-tool.biotec.tu-dresden.de/plip-web/plip/index
    - for each PDB you download the RST file/text files.
    - naming scheme for step1 function: [PDBID]_[chain]_result.txt
      - 1cjr_a_result.txt
      - 1cvu_f_result
    - files must be located in./peptide_result. i have already created this directory.
  - Please download PPDbench dataset (133), ligands. @Natalie Good
  - Create PLIP files for each protein. Note, not all proteins will have PLIP files (i.e. no interactions predicted). Place these into the directory below. @Natalie Good
    - ~/CAMP/data_prepare/step1/peptide_result
- step1_pdb_parse2.rmd- I created an R script to subset the step0 files to include ONLY the demo proteins from the PPDbench dataset.
  - output from this script for step1 and step2 functions. this needs to be modified to include the 133 proteins. This can also be written in python if needed. @Hadam, Evan J.@Yao, Anthony J.
    - pdb_pep_chain_demo
    - pdbid_all_fasta_demo
  - step1_pdb_parse3.py- this script consumes pdb_pep_chain_demo created by step1_pdb_parse2.rmd as well as the PLIP files in ./peptide_result.
  - output from this script:
    - plip_predict_result
  - step1_pdb_parse4.py- this script consumes the plip_predict_result and the pdbid_all_fasta_demo (step1_pdb_parse2.rmd).
    - outputs a file:
      - df_predict_det1
  - step1_pdb_parse5_tmp.py- this file is where I have left off. I can get the code to run- not as a function- but rather import the data and run, lin-by-line code. We needed to make slight modifications; however, I have noticed the output includes proteins that DONT fit some criteria. Given we only have 2 proteins, I think the code "chokes" because we dont have a large enough dataset. I am still working through the goals of the code however I think we need the full PPDbench (133 proteins) to test this chunk of code.
    - Input for this code is Uniprot files (pdb_chain_uniprot.tsv)- [downloaded here] and available in the repo.
      - pdb_chain_uniprot
      - df_predict_det1 (step1_pdb_parse4.py)
    - output
      - df_predict_det2_no_uni
      - df_predict_det3

AnthonyYao7 commented 1 year ago

Dr. Lemas, Re the email I sent yesterday, 9/27, I believe that using an offline version of PLIP is more appropriate, since we are analyzing several hundred thousand proteins. I think requesting them off of the website will take much more time than running them on the server.

lemaslab / CAMP

DEBUG: step1_pdb_process.py #7