glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Protein 3D Structures for protein page #1517

Open ReneRanzinger opened 3 weeks ago

ReneRanzinger commented 3 weeks ago

Provide a list of protein 3D structures as part of the protein details. PDB files are provided by #1355 and #1354. They need to be filtered based on the criteria developed by @jeet-vora and @rajamazumder. For each PDB file in the protein details JSON we need:

Dependencies:

Blocker for:

sujeetvkulkarni commented 3 weeks ago

@rykahsay Add pdb section to protein details api. Discuss with Raja whether to host coordinates. Based on this decide if you plan to give an api or url?

For glycan pdb api - https://api.glygen.org/glycan/pdb/G17689DH/

katewarner commented 3 weeks ago

@rykahsay

Here are the selection criteria for PDB Protein Structures from @jeet-vora and Raja, which you can use for filtering the downloaded PDB files.

Rules: 1) Length - Select the PDB accessions/structure that contains longest aa sequence.

2) Method - The structures resolved through the Xray method should be shortlisted first. If Xray structure is not available NMR structures are to be selected.

3) Resolution - From the shortlisted Xray structure choose the one with the highest resolution. NMR structure does not have a resolution, so select the NMR structure with the longest sequence.

4) Number of chains - If two structures have identical 1, 2 and 3 properties, then choose the accession with a lower number of chains.

Let me know if you need anymore information.

rykahsay commented 2 weeks ago

I have created new datasets *_protein_pdb_map.csv (also included them in dataset-masterlist.json) ... the last column shows if the mapping is selected or not. Please check and verify that the selection is done properly.

"uniprotkb_canonical_ac","sequence_region","pdb_chain","start_pos","end_pos","overlap_ratio","overlap_category","experimental_method","resolution","selection_flag"
"P51610-1","region_1","4GO6","1806","2035","1.0","0.75","X-Ray_Crystallography","2.7","True"
"P51610-1","region_2","4GO6","360","402","1.0","0.75","X-Ray_Crystallography","2.7","True"
"P51610-1","region_3","4N3A","1072","1097","1.0","0.75","X-Ray_Crystallography","1.88","True"
"P51610-1","region_3","4N3B","1072","1097","1.0","0.75","X-Ray_Crystallography","2.17","False"
"P51610-1","region_3","4N3C","1072","1097","1.0","0.75","X-Ray_Crystallography","2.55","False"
"P51610-1","region_3","4N39","1082","1097","0.6","0.5","X-Ray_Crystallography","1.76","False"
"P51610-1","region_3","5LWV","1078","1095","0.68","0.5","X-Ray_Crystallography","1.9","False"
"P51610-1","region_3","6MA3","1082","1097","0.6","0.5","X-Ray_Crystallography","2.0","False"
"P51610-1","region_3","6MA4","1082","1097","0.6","0.5","X-Ray_Crystallography","2.0","False"
"P51610-1","region_3","6MA5","1082","1097","0.6","0.5","X-Ray_Crystallography","2.0","False"
"P51610-1","region_3","6MA2","1082","1097","0.6","0.5","X-Ray_Crystallography","2.1","False"
"P51610-1","region_3","6MA1","1082","1097","0.6","0.5","X-Ray_Crystallography","2.75","False"
rykahsay commented 2 weeks ago

@katewarner ... I have added the chain information, please check the dataset file now.

Additional things: Based on the _protein_pdb_map.csv and _protein_xref_alphafolddb.csv dataset files, the pdb coordinate files are downloaded under downloads/pdb/current/. Since this download has to happen after the creation of these datasets, I will be the one to perform the download (just like I download medline and pubchem compound files). Please add this step of downloading PDB into the release protocol.

rykahsay commented 1 week ago

@sujeetvkulkarni ... I have added the "structures" section now:

image
katewarner commented 1 week ago

@rykahsay The dataset looks good. I picked some of the accessions at random and in ever one the correct structure had been selected. I also made a ticket to update our download documentation.

sujeetvkulkarni commented 6 days ago

@katewarner @rykahsay Can you please share a alpha fold example?

ReneRanzinger commented 5 days ago

@sujeetvkulkarni the display pattern in the dropdown for PDB files is: [pdb_id] (Amino acid: [start_pos] - [end_pos])

ReneRanzinger commented 5 days ago

@sujeetvkulkarni default display is PDB first. With multiple PDB entries select the one with the smallest start position. If they are the same - choose the one with the largest end position. If they are all the same choose arbitrary.

katewarner commented 4 days ago

@sujeetvkulkarni For the AlphaFold example, do you need just some protein accessions that have an AlphaFold structure?

katewarner commented 4 days ago

@rykahsay do I need to create BCOs for the *_protein_pdb_map.csv datasets?

sujeetvkulkarni commented 4 days ago

@sujeetvkulkarni For the AlphaFold example, do you need just some protein accessions that have an AlphaFold structure?

@katewarner yes, protein accession where GlyGen api returns AlphaFold structure.

katewarner commented 4 days ago

@sujeetvkulkarni Thank you. Here are some canonical accessions that have AlphaFold structures:

sujeetvkulkarni commented 4 days ago

@katewarner These proteins accessions have no structures:[{}] array and no AlphaFold url.

katewarner commented 4 days ago

@sujeetvkulkarni Do you mean within their .json objects? Sorry, I'm still learning where everything is.

sujeetvkulkarni commented 4 days ago

@katewarner Yes, I was talking about GlyGen api like below is not returning structures:[{}] array and no AlphaFold url in it. Like @rykahsay showed above. https://api.tst.glygen.org/protein/detail/Q5W0N0-1/ https://api.tst.glygen.org/protein/detail/A2RSJ4-1/ https://api.tst.glygen.org/protein/detail/Q54PT8-1/

katewarner commented 4 days ago

@sujeetvkulkarni Ah I see, I think I may have jumped ahead. The entries have the AlphaFold cross-references, but I think as we discussed in the meeting yesterday, @rykahsay is going to add the AlphaFold structures array to the entries.

sujeetvkulkarni commented 4 days ago

@katewarner ok, that's fine. When the AlphaFold url is added to the structures:[{}] array then please share such protein accessions with me.