glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Proposed headers and instructions for the user dataset #1566

Closed katewarner closed 2 months ago

katewarner commented 3 months ago

@jeet-vora @kmartinez834 Please review my draft instructions for creating the user dataset.

For now we are only going to extract the manually verified ("unambiguous") sites and c-mannosylation data from the user data file. The c-mannosylation data is all ambiguous and difficult to explain easily, so I've added those sites at the bottom of this ticket for Robel to integrate into the dataset.

Draft ticket

Source = 1-s2.0-S1535947624000070-mmc7.tsv Output = human_proteoform_glycosylation_sites_platelet_olinked.csv

Mapping Files: unreviewed/human_protein_masterlist.csv unreviewed/human_protein_allsequences.fasta misc/platelet_olinked_mapping.csv

The output file should have the following headers:

"uniprotkb_canonical_ac", "glycosylation_site_uniprotkb", "amino_acid", "saccharide", "glycosylation_type", "xref_key", "xref_id", "src_xref_key", "src_xref_id", "glycosylation_subtype", "composition", "cell_id", "cell_name", "site_type", "Notes", "start_pos", "end_pos", "start_aa", "end_aa", "site_seq", "peptide", "peptide_start_pos", "peptide_end_pos"`

Instructions for Robel:

For this release we are only going to to extract the manually verified ("unambiguous") sites and c-mannosylation data from the user data file so ignore all the rows with "unambiguous" in the "Glycosylation site Localisation Assignment" column.

The instructions below are for extracting the unambiguous sites which are all Ser and Thr glycosylation sites.

The c-mannosylation data is all ambiguous, is spread across a few tables and the manuscript. It will be easier to provide instructions for these sites when we also start to extract all the ambiguous data from the dataset, so for this release I've added those sites at the bottom of this ticket for you to integrate straight into the dataset.

Once the glycosylation site position and amino acid is extracted please check it's correct by mapping them to the proteins in our human_protein_allsequences.fasta file.

Source Field Output Field Notes
Protein Name uniprotkb_canonical_ac Extract UniProt AC string between the two pipes and map to canonical ac using human_protein_masterlist.csv field "uniprotkb_canonical_ac" e.g. "P07996-1" from >>sp|P07996|TSP1_HUMAN Thrombospondin-1 OS=Homo sapiens OX=9606 GN=THBS1 PE=1 SV=2
Peptide glycosylation_site_uniprotkb For "S" or "T" unambiguous sites, please calculate location site using the peptide and starting position fields. In "Peptide" source field between the two full stops, the glycosylation site position/s are the "S", or "T" followed by a square bracket e.g. this peptide has a "T" glycosylation site GKNWC[+57.021]AYVHT[+146.058]R.
Peptide amino_acid If "glycosylation_site_uniprotkb" output field, is "S" add "Ser" or is "T" add "Thr".
Glycans NHFAGNa saccharide For these compositions in Glycans NHFAGNa, I've created a mapping file to their GlyTouCan ID: misc/platelet_olinked_mapping.csv
  glycosylation_type If "amino_acid" output field is "S" or "T" add "O-linked"
  xref_key All rows: "protein_xref_pubmed"
  xref_id All rows: "38237698"
  src_xref_key All rows: "protein_xref_glygen_ds"
  src_xref_id All rows: "GLY_001051"
Glycans NHFAGNa glycosylation_subtype If composition begins with "Fuc" add "O-fucosylation" otherwise leave blank
Glycans NHFAGNa composition See "glycosylation_type" output field, if O-linked copy directly from Source
  cell_id All rows: "CL:0000233"
  cell_name All rows: "platelet"
Glycosylation site Localisation Assignment site_type All rows: "known"
  Notes If "uniprotkb_canonical_ac" output field is Q13201, add the following note: "Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications. Mutation of O-fucosylation sites within the EMI domain of MMRN1 affects secretion: T216A reduces secretion to at least 50%, whereas mutation of T1055A almost abolishes secretion. Fucosylation of these sites is carried out either by POFUT1 or a novel POFUT, but it is not carried out by POFUT2. Fucosylation of MMRN1 at T216, may represent a new POFUT1 O-fucosylation motif (C1-X-X-X-X-T-X) that is missing the typical C-terminal cysteine residue." If "uniprotkb_canonical_ac" output field is not Q13201, add the following note: ""Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications."
  start_pos Start_pos is the same as "glycosylation_site_uniprotkb"
  end_pos End_pos is the same as "glycosylation_site_uniprotkb"
  start_aa Same as "amino_acid" output field.
  end_aa Same as "amino_acid" output field.
  site_seq See "amino_acid" +A14:C21Output field. If output is "Ser" add "S", or if "Tyr" add "T"

Example

Input file: downloads/user_submission/platelet_olinked/current/1-s2.0-S1535947624000070-mmc7.tsv

"Peptide"   "Glycans NHFAGNa"   "Glycosylation site Localisation Assignment"    "Modification Type(s)"  "|Log Prob|"    "Delta Mod" "Protein Name"  "Observed m/z"  "z" "Observed (M+H)"    "Calc. mass (M+H)"  "Off-by-x error"    "Mass error (ppm)"  "Starting position" "Cleavage"  "Score" "Delta" "# of unique peptides"  "Protein DB number" "Comment"   "Scan #"    "Scan Time"
"R.KGGET[+162.053]SEMYLIQPDSSVKPYR.V"   "Hex(1)"    "Unambiguous"   "T[+162]"   "9" "426"   ">>sp|P02675|FIBB_HUMAN Fibrinogen beta chain OS=Homo sapiens OX=9606 GN=FGB PE=1 SV=2" "637"   "4" "2547"  "2547"  "0" "0" "247"   "Specific"  "426"   "426"   "101"   "2107"  "scan=146344"   "scan=146344"   "46"
"K.C[+57.021]GAC[+57.021]PPGYS[+203.079]GNGIQC[+57.021]TDVDEC[+57.021]KEVPDAC[+57.021]FNHNGEHR.C"   "HexNAc(1)" "Unambiguous"   "C[+57]*5, S[+203]" "7" "259"   ">>sp|P07996|TSP1_HUMAN Thrombospondin-1 OS=Homo sapiens OX=9606 GN=THBS1 PE=1 SV=2"    "1078"  "4" "4310"  "4310"  "0" "0" "572"   "Specific"  "259"   "259"   "175"   "2657"  "scan=144741"   "scan=144741"   "41"
"R.GKNWC[+57.021]AYVHT[+146.058]R.L"    "Fuc(1)"    "Unambiguous"   "C[+57], T[+146]"   "4" "221"   ">>sp|Q13201|MMRN1_HUMAN Multimerin-1 OS=Homo sapiens OX=9606 GN=MMRN1 PE=1 SV=3"   "513"   "3" "1537"  "1537"  "0" "0" "207"   "Specific"  "299"   "221"   "71"    "663"   "scan=142155"   "scan=142155"   "35"

Output file: human_proteoform_glycosylation_sites_platelet_olinked.csv

uniprotkb_canonical_ac,glycosylation_site_uniprotkb,amino_acid,saccharide,glycosylation_type,xref_key,xref_id,src_xref_key,src_xref_id,glycosylation_subtype,composition,cell_id,cell_name,site_type,Notes,start_pos,end_pos,start_aa,end_aa,site_seq
P02675-1,251,Thr,G81399MY,O-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,,Hex(1),CL:0000233,platelet,Known,Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications.,251,251,Thr,Thr,T
P07996-1,580,Ser,G29068FM,O-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,,HexNAc(1),CL:0000233,platelet,Known,Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications.,580,580,Ser,Ser,S
Q13201-1,216,Thr,G49112ZN,O-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,O-fucosylation,dHex(1),CL:0000234,platelet,Known,"Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications. Mutation of O-fucosylation sites within the EMI domain of MMRN1 affects secretion: T216A reduces secretion to at least 50%, whereas mutation of T1055A almost abolishes secretion. Fucosylation of these sites is carried out either by POFUT1 or a novel POFUT, but it is not carried out by POFUT2. Fucosylation of MMRN1 at T216, may represent a new POFUT1 O-fucosylation motif (C1-X-X-X-X-T-X) that is missing the typical C-terminal cysteine residue.",216,216,Thr,Thr,T

C-mannosylation sites

uniprotkb_canonical_ac,glycosylation_site_uniprotkb,amino_acid,saccharide,glycosylation_type,xref_key,xref_id,src_xref_key,src_xref_id,glycosylation_subtype,composition,cell_id,cell_name,site_type,Notes,start_pos,end_pos,start_aa,end_aa,site_seq
P07996-1,,Trp,G61491DK,C-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,C-mannosylation,Man(1),CL:0000233,platelet,Unknown,Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications.,372,402,Trp,Trp,W
P07996-1,,Trp,G61491DK,C-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,C-mannosylation,Man(1),CL:0000233,platelet,Unknown,Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications.,432,458,Trp,Trp,W
P07996-1,,Trp,G61491DK,C-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,C-mannosylation,Man(1),CL:0000233,platelet,Unknown,Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications.,486,514,Trp,Trp,W
kmartinez834 commented 2 months ago

Looks good, just a couple things:

Source Field Output Field Notes
Glycans NHFAGNa saccharide For these compositions in Glycans NHFAGNa, use the following GlyTouCan Acs: Hex(1) = G81399MY, Fuc(1) = G96881BQ, HexNac(1) = G29068FM, Hex(1)Fuc(1) = G42494UJ, Hex(1)Pent(2) = G42518JM, HexNAc(1)Hex(1)NeuAc(1) = G17015OC, HexNAc(1)Hex(1)NeuAc(2) = G23729WG
ubhuiyan commented 2 months ago

@kmartinez834 and @katewarner My apologies for not doing this sooner. I've converted the user submitted data to tsv format. The new source should be as follows:

Source: downloads/user_submission/platelet_o_linked/current/1-s2.0-S1535947624000070-mmc7.tsv