glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

New dataset: human_proteoform_glycosylation_sites_pdc_ccrc.csv #1274

Open kmartinez834 opened 3 weeks ago

kmartinez834 commented 3 weeks ago

Source file: downloads/pdc/current/ccRCC_TMT_intact_glycopeptide_abundance_MD-MAD.tsv

Mapping files: unreviewed/human_protein_masterlist.csv generated/misc/pdc_glytoucan_mapping.csv unreviewed/*_protein_glycosylation_motifs.csv misc/n_sequon_info.csv

Output files: human_proteoform_glycosylation_sites_pdc_ccrc.csv

The output file should have the following headers:

"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","xref_key","xref_id","start_pos","end_pos","start_aa","end_aa","site_seq","composition","abundance","biospecimen_id","source_tissue_id","source_tissue_name","n_sequon","n_sequon_type"
See the chart below for instructions on mapping source fields to output: Source field Output field Instructions
Gene uniprotkb_canonical_ac Map to canonical ac using human_protein_masterlist.csv fields "gene_name" and "uniprotkb_canonical_ac"
Glycosite glycosylation_site_uniprotkb Strip first character (N) from "Glycosite", remaining numbers are the glycosylation site
Glycosite amino_acid First character in "Glycosite" is the one letter amino acid symbol,
if N then "amino_acid" = Asp
Nglycan saccharide Map to glytoucan ac using pdc_glytoucan_mapping.csv fields "Nglycan" and "glytoucan"
glycosylation_type All rows: "N-linked"
xref_key Each record should have a row with "protein_xref_pubmed" and a row with "protein_xref_pdc"
xref_id If "xref_key" is "protein_xref_pubmed" then "xref_id" = "37074911"
If "xref_key" is "protein_xref_pdc" then "xref_id" = "PDC000471"
start_pos Same as output file "glycosylation_site_uniprotkb" above
end_pos Same as output file "glycosylation_site_uniprotkb" above
start_aa Same as output file "amino_acid" above
end_aa Same as output file "amino_acid" above
Stripped_Sequence site_seq No change, copy directly from "Stripped_Sequence"
Nglycan composition No change, copy directly from "Nglycan"
CPT* abundance Columns that begin with "CPT" contain abundance data (unless value is NA), see example below
CPT* biospecimen_id Column associated with abundance value, see example below
source_tissue_id All rows: "UBERON:0002113"
source_tissue_name All rows: "kidney"
n_sequon Map using *_protein_glycosylation_motifs.csv
n_sequon_type Map using misc/n_sequon_info.csv

Example:

Input file

$ head -2 /data/projects/glygen/downloads/pdc/current/ccRCC_TMT_intact_glycopeptide_abundance_MD-MAD.tsv 
Modifications   Gene    Stripped_Sequence       Glycosite       Nglycan CPT0079430001   CPT0023360001   CPT0023350003   CPT0079410003   CPT0087040003   CPT0077310003   CPT0077320001   CPT0087050003   CPT0002270011    NCI7-1  CPT0078840001   CPT0075570001   CPT0075560003   CPT0078830003   CPT0077490003   CPT0077500001   CPT0023690003   CPT0023710001   CPT0025060001   CPT0092290003   CPT0014130001   CPT0001230001    CPT0071150004   CPT0014160003   CPT0092310003   CPT0001220008   CPT0025050003   QC1     CPT0088500001   CPT0078510003   CPT0079510001   CPT0089020003   CPT0078530001   CPT0079480003   CPT0088480003    CPT0089040001   CPT0026410003   CPT0006440003   CPT0066470004   CPT0066430001   CPT0088970003   CPT0020020001   CPT0006530001   CPT0026420001   CPT0019990003   CPT0000790001   CPT0001550001    CPT0065450001   CPT0001540009   CPT0066480003   CPT0066520001   CPT0065430003   QC2     CPT0000780007   CPT0019130003   CPT0014350001   CPT0079000001   CPT0078990003   CPT0000870016   CPT0014370004    CPT0019160001   CPT0000890001   QC3     CPT0065820001   CPT0086360003   CPT0092800003   CPT0065810003   CPT0092790003   CPT0001500009   NCI7-2  CPT0086370003   CPT0001510001   NCI7-3  CPT0006950001    CPT0006900003   CPT0010120001   CPT0025610001   CPT0081600003   CPT0081620001   CPT0025580004   CPT0010110003   CPT0001180009   CPT0082010001   CPT0015910003   CPT0086870003   CPT0063330001   CPT0001190001    CPT0063320003   CPT0081990003   CPT0086890003   CPT0078660003   CPT0001340003   CPT0020130001   CPT0075170001   CPT0001350001   QC4     CPT0078670001   CPT0075130003   CPT0020120003   CPT0078930003    NCI7-4  CPT0089480003   CPT0000640003   CPT0078940001   CPT0088630003   CPT0089460004   CPT0000660001   CPT0088640003   CPT0084560003   CPT0084590001   CPT0007470001   CPT0065870003   CPT0086830003    CPT0069000003   CPT0007320003   CPT0086820003   CPT0069010001   CPT0002350011   CPT0063640001   NCI7-5  CPT0088780001   CPT0088760003   CPT0088710001   CPT0088690003   CPT0002370001   CPT0063630003    CPT0064370003   CPT0010170001   CPT0064400001   CPT0086970003   CPT0092190003   CPT0092160003   CPT0010160003   QC5     CPT0086950003   CPT0088900003   CPT0079270003   CPT0088920001   CPT0079300001    CPT0088550004   QC6     CPT0014450004   CPT0088570001   CPT0014470001   CPT0006730001   CPT0069190001   CPT0092730003   CPT0092740003   QC7     CPT0006630003   CPT0025920001   CPT0025880003    CPT0069160003   CPT0007870001   CPT0001270001   CPT0077110003   QC8     CPT0077140001   CPT0001260009   CPT0076350001   CPT0076330003   CPT0007860003   CPT0079380003   CPT0015810003   CPT0086030003    CPT0085670003   CPT0025230003   CPT0065750003   CPT0015730003   CPT0078800003   CPT0079230003   CPT0025170003   CPT0025110003   CPT0025290003   CPT0081880003   CPT0075720003   CPT0065930003   CPT0025350003    CPT0065690003   CPT0011410003   CPT0024670003   CPT0024680001   CPT0012550003   CPT0013790003   CPT0012570003   CPT0012770003   CPT0012370003   CPT0011240003   CPT0079180003   CPT0012640003    CPT0012090003   CPT0018250001   CPT0012290003   CPT0017850003   CPT0012180003   CPT0012280003   CPT0012670003   CPT0012080003   CPT0021240003   CPT0009020003   CPT0017450001   CPT0009060003   CPT0012900004    CPT0017410003   CPT0009080003   CPT0012920003   CPT0009000003
n[TMT10plex]HEEGHMLNC[Carbamidomethyl]TC[Carbamidomethyl]FGQGR-N4H5F1S1G0       FN1     HEEGHMLnCTCFGQGR        N542    N4H5F1S1G0      14.12863483     14.08341759     15.49226162     15.10015675     13.23733345      12.93404645     14.82586239     13.72512866     13.96543842     NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA       NA      NA      14.24515438     13.85167066     13.91497243     13.9499045      14.26456562     14.71406602     14.55778476     14.72854397     13.7980878      NA      NA      NA      NA      NA       NA      NA      NA      NA      14.08127534     14.45163008     13.92618371     14.61823008     14.79534535     14.02875797     15.14877273     14.58275397     14.71911482     NA      NA      NA       NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      14.16086808     13.96473585     14.16176343     15.41824375     14.14337547      15.92210478     13.89564847     14.74816317     15.13393566     14.94342308     14.09560549     15.54460697     15.70019943     13.84894753     14.17369733     14.06806634     14.74985715     14.09854491      14.38976816     14.25802028     14.71239974     14.75380964     13.73840268     14.6249825      13.90496312     17.04792342     14.00273439     12.70860678     14.4214299      13.49916724      14.72918133     14.51872428     15.63669142     13.50951228     14.15798596     12.91701115     16.26418564     14.50845112     13.86440128     14.47067085     14.69813592     14.9156719      14.68091949      15.50781341     14.13837265     13.9285945      13.61798666     14.02121709     13.9499586      14.71585104     12.9398122      16.40636908     13.77310879     15.34931054     NA      NA       NA      NA      NA      NA      NA      NA      NA      15.32455626     14.23655604     12.99992774     14.40615408     15.42331026     15.04941769     13.81915768     13.79767537     14.56221863      14.52925805     14.1911972      14.39331257     13.4036519      14.04486226     NA      13.56440843     15.79361787     16.01455364     14.42946858     15.48912541     14.65218174     14.88435213.97514806     16.5821253      14.10701256     15.72975633     12.9746948      14.11900271     14.61096821     13.54073283     16.98309435     15.68012723     13.14170098     13.84698903     14.09733428      15.14352964     12.90985661     14.75812395     12.99124646     13.6559008      15.67886987     13.86917444     15.41116145     13.78625343     13.99981151     13.64574638     14.1046755      13.96961466      13.98838589     14.07387643     14.13654703     15.19373777     15.65156292     14.4242206      14.77307166     13.80607425     14.99745277     14.54411652     15.96208462     15.87677116      15.27428204     15.92128194     15.42722906     14.13236929     14.39347665     14.23334998     14.56369486     13.67542967     15.66272586     14.58793946     14.59447569     14.06205825

Output file

"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","xref_key","xref_id","start_pos","end_pos","start_aa","end_aa","site_seq","composition","abundance","source_tissue_id","source_tissue_name","n_sequon","n_sequon_type"
"P02751-15","542","Asn","G27058EU","N-linked","protein_xref_pubmed","37074911","542","542","Asn","Asn","HEEGHMLnCTCFGQGR","N4H5F1S1G0","14.12863483","
CPT0079430001","UBERON:0002113","kidney","NCT","NXT"
"P02751-15","542","Asn","G27058EU","N-linked","protein_xref_pdc","PDC000471","542","542","Asn","Asn","HEEGHMLnCTCFGQGR","N4H5F1S1G0","14.12863483","
CPT0079430001","UBERON:0002113","kidney","NCT","NXT"
"P02751-15","542","Asn","G27058EU","N-linked","protein_xref_pubmed","37074911","542","542","Asn","Asn","HEEGHMLnCTCFGQGR","N4H5F1S1G0","14.08341759","
CPT0023360001","UBERON:0002113","kidney","NCT","NXT"
"P02751-15","542","Asn","G27058EU","N-linked","protein_xref_pdc","PDC000471","542","542","Asn","Asn","HEEGHMLnCTCFGQGR","N4H5F1S1G0","14.08341759","
CPT0023360001","UBERON:0002113","kidney","NCT","NXT"
"P02751-15","542","Asn","G27058EU","N-linked","protein_xref_pubmed","37074911","542","542","Asn","Asn","HEEGHMLnCTCFGQGR","N4H5F1S1G0","15.49226162","
CPT0023350003","UBERON:0002113","kidney","NCT","NXT"
"P02751-15","542","Asn","G27058EU","N-linked","protein_xref_pdc","PDC000471","542","542","Asn","Asn","HEEGHMLnCTCFGQGR","N4H5F1S1G0","15.49226162","
CPT0023350003","UBERON:0002113","kidney","NCT","NXT"
"P02751-15","542","Asn","G27058EU","N-linked","protein_xref_pubmed","37074911","542","542","Asn","Asn","HEEGHMLnCTCFGQGR","N4H5F1S1G0","15.10015675","
CPT0079410003","UBERON:0002113","kidney","NCT","NXT"
"P02751-15","542","Asn","G27058EU","N-linked","protein_xref_pdc","PDC000471","542","542","Asn","Asn","HEEGHMLnCTCFGQGR","N4H5F1S1G0","15.10015675","
CPT0079410003","UBERON:0002113","kidney","NCT","NXT"
...

@ubhuiyan

kmartinez834 commented 2 weeks ago

@rykahsay instructions are complete now

rykahsay commented 2 weeks ago

@kmartinez834 ... why do we need to ignore rows if abudance="NA"? Aren't rows missing abundance info good enough?

rykahsay commented 2 weeks ago

Please check the dataset

kmartinez834 commented 2 weeks ago

Ok to keep rows with abundance ="NA" Checking dataset now

kmartinez834 commented 2 weeks ago
rykahsay commented 2 weeks ago

That is done, here a more important issue: the abundance values are give at "biospecimen_id" level, and multiple values exist for a givem biospecimen_id. On the other hand, our glycan data model is designed to give abundance at the tissue level. This means, if this data is to fit into our existing glycan data model, we need to summarize the abundance levels to the tissue level (for example, we can take the average) -- we need to discuss this in the general meeting.

"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","saccharide","source_tissue_id","source_tissue_name","biospecimen_id","abundance"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000640003","13.28779152"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000640003","14.72918133"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000640003","16.01965604"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000660001","12.09132859"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000660001","14.15798596"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000660001","16.13148149"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000780007","13.7580814"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000780007","14.71911482"
"P02751-15","542","G27058EU","UBERON:0002113","kidney","CPT0000780007","15.94426322"
kmartinez834 commented 5 days ago

Documenting the discussion we had last week here: Other publication pages include multiple rows for expression (see ex below). We will see if this causes problems when the publication object is created.

https://glygen.org/publication/DOI/10.1016/j.talanta.2020.121495#Expression image