Closed kmartinez834 closed 3 months ago
@rykahsay instructions are complete now
@kmartinez834 ... why do we need to ignore rows if abudance="NA"? Aren't rows missing abundance info good enough?
Please check the dataset
Ok to keep rows with abundance ="NA" Checking dataset now
Are "src_xref_key","src_xref_id" required? And would it be more efficient to just have one row with xref_key and src_xref_key rather than two rows for each entry? "xref_key","xref_id","src_xref_key","src_xref_id" "protein_xref_pubmed","37074911","protein_xref_pdc","PDC000471" ~"protein_xref_pdc","PDC000471","protein_xref_pdc","PDC000471"~
For the API, does it matter if "start_aa","end_aa" = "N" or "Asn" ?
That is done, here a more important issue: the abundance values are give at "biospecimen_id" level, and multiple values exist for a givem biospecimen_id. On the other hand, our glycan data model is designed to give abundance at the tissue level. This means, if this data is to fit into our existing glycan data model, we need to summarize the abundance levels to the tissue level (for example, we can take the average) -- we need to discuss this in the general meeting.
"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","saccharide","source_tissue_id","source_tissue_name","biospecimen_id","abundance"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000640003","13.28779152"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000640003","14.72918133"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000640003","16.01965604"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000660001","12.09132859"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000660001","14.15798596"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000660001","16.13148149"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000780007","13.7580814"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000780007","14.71911482"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000780007","15.94426322"
Documenting the discussion we had last week here: Other publication pages include multiple rows for expression (see ex below). We will see if this causes problems when the publication object is created.
https://glygen.org/publication/DOI/10.1016/j.talanta.2020.121495#Expression
Missing page for https://glygen.org/publication/PubMed/37074911
check now
Publication page is missing Glycosylation and Expression sections (from human_proteoform_glycosylation_sites_pdc_ccrc.csv)
check now:
Issue with pagination, not sure if this is front end or back end... When I click on any other page, it gives no results:
@sujeetvkulkarni can you check if the pagination information you send to server are correct for this publication page?
@sujeetvkulkarni ... can you please give me the query you are sending. The following queries are working:
$ http POST :4042/publication/detail/ < tests/examples/publication/publication_detail.page.1.json
$ http POST :4042/publication/detail/ < tests/examples/publication/publication_detail.page.2.json
Containts of query files:
$ cat tests/examples/publication/publication_detail.page.1.json
{
"id":"30379171",
"type":"PubMed",
"paginated_tables":[
{"table_id":"glycosylation_reported_with_glycan","offset":1,"limit":20,"sort":"start_pos","order":"asc"}
]
}
$ cat tests/examples/publication/publication_detail.page.2.json
{
"id":"30379171",
"type":"PubMed",
"paginated_tables":[
{"table_id":"glycosylation_reported_with_glycan","offset":2,"limit":20,"sort":"start_pos","order":"asc"}
]
}
Sample query: @rykahsay We use pagination API for server side paginated table. API: https://api.glygen.org/pagination/page/ and below is the payload.
{
"record_type": "publication",
"table_id": "glycosylation_reported_with_glycan",
"record_id": "37074911",
"offset": 21,
"limit": 20,
"order": "asc",
"sort": "start_pos"
}
response
{
"query": {
"record_type": "publication",
"table_id": "glycosylation_reported_with_glycan",
"record_id": "37074911",
"offset": 21,
"limit": 20,
"order": "asc",
"sort": "start_pos"
},
"results": []
}
results should not be empty array.
Please check now:
$ cat tests/temp/q.json
{
"record_type": "publication",
"table_id": "glycosylation_reported_with_glycan",
"record_id": "37074911",
"offset": 21,
"limit": 20,
"order": "asc",
"sort": "start_pos"
}
$ http POST :4442/pagination/page/ < tests/temp/q.json
HTTP/1.1 200 OK
Connection: close
Content-Length: 10827
Content-Type: application/json
Date: Thu, 29 Aug 2024 18:10:52 GMT
Server: gunicorn
{
"query": {
"limit": 20,
"offset": 21,
"order": "asc",
"record_id": "37074911",
"record_type": "publication",
"sort": "start_pos",
"table_id": "glycosylation_reported_with_glycan"
},
"results": [
{
"comment": "",
"end_aa": "N",
"end_pos": 4,
"evidence": [
{
"database": "PubMed",
"id": "37074911",
"url": "https://pubmed.ncbi.nlm.nih.gov/37074911"
},
{
"database": "PDC",
"id": "PDC000471",
"url": "https://proteomic.datacommons.cancer.gov/pdc/study/PDC000471"
}
],
"glytoucan_ac": "G12341GU",
"relation": "attached",
"residue": "Asn",
"site_category": "reported_with_glycan",
"site_lbl": "Asn4",
"site_seq": "MDPnCSCAAGDSCTCAGSCK",
"start_aa": "N",
"start_pos": 4,
"subtype": "",
"type": "N-linked",
@rykahsay as total result count is 40463 so I think pagination is bit slow but its working.
Adding example label for publication id with large glycosylation table (40463 count) : 37074911.
Source file: downloads/pdc/current/ccRCC_TMT_intact_glycopeptide_abundance_MD-MAD.tsv
Mapping files: unreviewed/human_protein_masterlist.csv generated/misc/pdc_glytoucan_mapping.csv unreviewed/*_protein_glycosylation_motifs.csv misc/n_sequon_info.csv
Output files: human_proteoform_glycosylation_sites_pdc_ccrc.csv
The output file should have the following headers:
if N then "amino_acid" = Asp
If "xref_key" is "protein_xref_pdc" then "xref_id" = "PDC000471"
Example:
Input file
Output file
@ubhuiyan