Open ReneRanzinger opened 3 weeks ago
@rykahsay Add pdb section to protein details api. Discuss with Raja whether to host coordinates. Based on this decide if you plan to give an api or url?
For glycan pdb api - https://api.glygen.org/glycan/pdb/G17689DH/
@rykahsay
Here are the selection criteria for PDB Protein Structures from @jeet-vora and Raja, which you can use for filtering the downloaded PDB files.
Rules: 1) Length - Select the PDB accessions/structure that contains longest aa sequence.
2) Method - The structures resolved through the Xray method should be shortlisted first. If Xray structure is not available NMR structures are to be selected.
3) Resolution - From the shortlisted Xray structure choose the one with the highest resolution. NMR structure does not have a resolution, so select the NMR structure with the longest sequence.
4) Number of chains - If two structures have identical 1, 2 and 3 properties, then choose the accession with a lower number of chains.
Let me know if you need anymore information.
I have created new datasets *_protein_pdb_map.csv (also included them in dataset-masterlist.json) ... the last column shows if the mapping is selected or not. Please check and verify that the selection is done properly.
"uniprotkb_canonical_ac","sequence_region","pdb_chain","start_pos","end_pos","overlap_ratio","overlap_category","experimental_method","resolution","selection_flag"
"P51610-1","region_1","4GO6","1806","2035","1.0","0.75","X-Ray_Crystallography","2.7","True"
"P51610-1","region_2","4GO6","360","402","1.0","0.75","X-Ray_Crystallography","2.7","True"
"P51610-1","region_3","4N3A","1072","1097","1.0","0.75","X-Ray_Crystallography","1.88","True"
"P51610-1","region_3","4N3B","1072","1097","1.0","0.75","X-Ray_Crystallography","2.17","False"
"P51610-1","region_3","4N3C","1072","1097","1.0","0.75","X-Ray_Crystallography","2.55","False"
"P51610-1","region_3","4N39","1082","1097","0.6","0.5","X-Ray_Crystallography","1.76","False"
"P51610-1","region_3","5LWV","1078","1095","0.68","0.5","X-Ray_Crystallography","1.9","False"
"P51610-1","region_3","6MA3","1082","1097","0.6","0.5","X-Ray_Crystallography","2.0","False"
"P51610-1","region_3","6MA4","1082","1097","0.6","0.5","X-Ray_Crystallography","2.0","False"
"P51610-1","region_3","6MA5","1082","1097","0.6","0.5","X-Ray_Crystallography","2.0","False"
"P51610-1","region_3","6MA2","1082","1097","0.6","0.5","X-Ray_Crystallography","2.1","False"
"P51610-1","region_3","6MA1","1082","1097","0.6","0.5","X-Ray_Crystallography","2.75","False"
@katewarner ... I have added the chain information, please check the dataset file now.
Additional things: Based on the _protein_pdb_map.csv and _protein_xref_alphafolddb.csv dataset files, the pdb coordinate files are downloaded under downloads/pdb/current/. Since this download has to happen after the creation of these datasets, I will be the one to perform the download (just like I download medline and pubchem compound files). Please add this step of downloading PDB into the release protocol.
@sujeetvkulkarni ... I have added the "structures" section now:
@rykahsay The dataset looks good. I picked some of the accessions at random and in ever one the correct structure had been selected. I also made a ticket to update our download documentation.
@katewarner @rykahsay Can you please share a alpha fold example?
@sujeetvkulkarni the display pattern in the dropdown for PDB files is: [pdb_id] (Amino acid: [start_pos] - [end_pos])
@sujeetvkulkarni default display is PDB first. With multiple PDB entries select the one with the smallest start position. If they are the same - choose the one with the largest end position. If they are all the same choose arbitrary.
@sujeetvkulkarni For the AlphaFold example, do you need just some protein accessions that have an AlphaFold structure?
@rykahsay do I need to create BCOs for the *_protein_pdb_map.csv datasets?
@sujeetvkulkarni For the AlphaFold example, do you need just some protein accessions that have an AlphaFold structure?
@katewarner yes, protein accession where GlyGen api returns AlphaFold structure.
@sujeetvkulkarni Thank you. Here are some canonical accessions that have AlphaFold structures:
@katewarner These proteins accessions have no structures:[{}] array and no AlphaFold url.
@sujeetvkulkarni Do you mean within their .json objects? Sorry, I'm still learning where everything is.
@katewarner Yes, I was talking about GlyGen api like below is not returning structures:[{}] array and no AlphaFold url in it. Like @rykahsay showed above. https://api.tst.glygen.org/protein/detail/Q5W0N0-1/ https://api.tst.glygen.org/protein/detail/A2RSJ4-1/ https://api.tst.glygen.org/protein/detail/Q54PT8-1/
@sujeetvkulkarni Ah I see, I think I may have jumped ahead. The entries have the AlphaFold cross-references, but I think as we discussed in the meeting yesterday, @rykahsay is going to add the AlphaFold structures array to the entries.
@katewarner ok, that's fine. When the AlphaFold url is added to the structures:[{}] array then please share such protein accessions with me.
Provide a list of protein 3D structures as part of the protein details. PDB files are provided by #1355 and #1354. They need to be filtered based on the criteria developed by @jeet-vora and @rajamazumder. For each PDB file in the protein details JSON we need:
Dependencies:
1354
1355
1476
Blocker for:
1518
1489