Open kmartinez834 opened 3 weeks ago
@rykahsay instructions are complete now
@kmartinez834 ... why do we need to ignore rows if abudance="NA"? Aren't rows missing abundance info good enough?
Please check the dataset
Ok to keep rows with abundance ="NA" Checking dataset now
Are "src_xref_key","src_xref_id" required? And would it be more efficient to just have one row with xref_key and src_xref_key rather than two rows for each entry? "xref_key","xref_id","src_xref_key","src_xref_id" "protein_xref_pubmed","37074911","protein_xref_pdc","PDC000471" ~"protein_xref_pdc","PDC000471","protein_xref_pdc","PDC000471"~
For the API, does it matter if "start_aa","end_aa" = "N" or "Asn" ?
That is done, here a more important issue: the abundance values are give at "biospecimen_id" level, and multiple values exist for a givem biospecimen_id. On the other hand, our glycan data model is designed to give abundance at the tissue level. This means, if this data is to fit into our existing glycan data model, we need to summarize the abundance levels to the tissue level (for example, we can take the average) -- we need to discuss this in the general meeting.
"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","saccharide","source_tissue_id","source_tissue_name","biospecimen_id","abundance"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000640003","13.28779152"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000640003","14.72918133"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000640003","16.01965604"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000660001","12.09132859"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000660001","14.15798596"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000660001","16.13148149"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000780007","13.7580814"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000780007","14.71911482"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000780007","15.94426322"
Documenting the discussion we had last week here: Other publication pages include multiple rows for expression (see ex below). We will see if this causes problems when the publication object is created.
https://glygen.org/publication/DOI/10.1016/j.talanta.2020.121495#Expression
Source file: downloads/pdc/current/ccRCC_TMT_intact_glycopeptide_abundance_MD-MAD.tsv
Mapping files: unreviewed/human_protein_masterlist.csv generated/misc/pdc_glytoucan_mapping.csv unreviewed/*_protein_glycosylation_motifs.csv misc/n_sequon_info.csv
Output files: human_proteoform_glycosylation_sites_pdc_ccrc.csv
The output file should have the following headers:
if N then "amino_acid" = Asp
If "xref_key" is "protein_xref_pdc" then "xref_id" = "PDC000471"
Example:
Input file
Output file
@ubhuiyan