Proposed headers and instructions for the user dataset

katewarner commented 3 months ago

@jeet-vora @kmartinez834 Please review my draft instructions for creating the user dataset.

For now we are only going to extract the manually verified ("unambiguous") sites and c-mannosylation data from the user data file. The c-mannosylation data is all ambiguous and difficult to explain easily, so I've added those sites at the bottom of this ticket for Robel to integrate into the dataset.

Draft ticket

Source = 1-s2.0-S1535947624000070-mmc7.tsv Output = human_proteoform_glycosylation_sites_platelet_olinked.csv

Mapping Files: unreviewed/human_protein_masterlist.csv unreviewed/human_protein_allsequences.fasta misc/platelet_olinked_mapping.csv

The output file should have the following headers:

"uniprotkb_canonical_ac", "glycosylation_site_uniprotkb", "amino_acid", "saccharide", "glycosylation_type", "xref_key", "xref_id", "src_xref_key", "src_xref_id", "glycosylation_subtype", "composition", "cell_id", "cell_name", "site_type", "Notes", "start_pos", "end_pos", "start_aa", "end_aa", "site_seq", "peptide", "peptide_start_pos", "peptide_end_pos"`

Instructions for Robel:

For this release we are only going to to extract the manually verified ("unambiguous") sites and c-mannosylation data from the user data file so ignore all the rows with "unambiguous" in the "Glycosylation site Localisation Assignment" column.

The instructions below are for extracting the unambiguous sites which are all Ser and Thr glycosylation sites.

The c-mannosylation data is all ambiguous, is spread across a few tables and the manuscript. It will be easier to provide instructions for these sites when we also start to extract all the ambiguous data from the dataset, so for this release I've added those sites at the bottom of this ticket for you to integrate straight into the dataset.

Once the glycosylation site position and amino acid is extracted please check it's correct by mapping them to the proteins in our human_protein_allsequences.fasta file.

Source Field	Output Field	Notes
Protein Name	uniprotkb_canonical_ac	Extract UniProt AC string between the two pipes and map to canonical ac using human_protein_masterlist.csv field "uniprotkb_canonical_ac" e.g. "P07996-1" from >>sp\|P07996\|TSP1_HUMAN Thrombospondin-1 OS=Homo sapiens OX=9606 GN=THBS1 PE=1 SV=2
Peptide	glycosylation_site_uniprotkb	For "S" or "T" unambiguous sites, please calculate location site using the peptide and starting position fields. In "Peptide" source field between the two full stops, the glycosylation site position/s are the "S", or "T" followed by a square bracket e.g. this peptide has a "T" glycosylation site GKNWC[+57.021]AYVHT[+146.058]R.
Peptide	amino_acid	If "glycosylation_site_uniprotkb" output field, is "S" add "Ser" or is "T" add "Thr".
Glycans NHFAGNa	saccharide	For these compositions in Glycans NHFAGNa, I've created a mapping file to their GlyTouCan ID: `misc/platelet_olinked_mapping.csv`
	glycosylation_type	If "amino_acid" output field is "S" or "T" add "O-linked"
	xref_key	All rows: "protein_xref_pubmed"
	xref_id	All rows: "38237698"
	src_xref_key	All rows: "protein_xref_glygen_ds"
	src_xref_id	All rows: "GLY_001051"
Glycans NHFAGNa	glycosylation_subtype	If composition begins with "Fuc" add "O-fucosylation" otherwise leave blank
Glycans NHFAGNa	composition	See "glycosylation_type" output field, if O-linked copy directly from Source
	cell_id	All rows: "CL:0000233"
	cell_name	All rows: "platelet"
Glycosylation site Localisation Assignment	site_type	All rows: "known"
	Notes	If "uniprotkb_canonical_ac" output field is Q13201, add the following note: "Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications. Mutation of O-fucosylation sites within the EMI domain of MMRN1 affects secretion: T216A reduces secretion to at least 50%, whereas mutation of T1055A almost abolishes secretion. Fucosylation of these sites is carried out either by POFUT1 or a novel POFUT, but it is not carried out by POFUT2. Fucosylation of MMRN1 at T216, may represent a new POFUT1 O-fucosylation motif (C1-X-X-X-X-T-X) that is missing the typical C-terminal cysteine residue." If "uniprotkb_canonical_ac" output field is not Q13201, add the following note: ""Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications."
	start_pos	Start_pos is the same as "glycosylation_site_uniprotkb"
	end_pos	End_pos is the same as "glycosylation_site_uniprotkb"
	start_aa	Same as "amino_acid" output field.
	end_aa	Same as "amino_acid" output field.
	site_seq	See "amino_acid" +A14:C21Output field. If output is "Ser" add "S", or if "Tyr" add "T"

Example

Input file: downloads/user_submission/platelet_olinked/current/1-s2.0-S1535947624000070-mmc7.tsv

"Peptide"   "Glycans NHFAGNa"   "Glycosylation site Localisation Assignment"    "Modification Type(s)"  "|Log Prob|"    "Delta Mod" "Protein Name"  "Observed m/z"  "z" "Observed (M+H)"    "Calc. mass (M+H)"  "Off-by-x error"    "Mass error (ppm)"  "Starting position" "Cleavage"  "Score" "Delta" "# of unique peptides"  "Protein DB number" "Comment"   "Scan #"    "Scan Time"
"R.KGGET[+162.053]SEMYLIQPDSSVKPYR.V"   "Hex(1)"    "Unambiguous"   "T[+162]"   "9" "426"   ">>sp|P02675|FIBB_HUMAN Fibrinogen beta chain OS=Homo sapiens OX=9606 GN=FGB PE=1 SV=2" "637"   "4" "2547"  "2547"  "0" "0" "247"   "Specific"  "426"   "426"   "101"   "2107"  "scan=146344"   "scan=146344"   "46"
"K.C[+57.021]GAC[+57.021]PPGYS[+203.079]GNGIQC[+57.021]TDVDEC[+57.021]KEVPDAC[+57.021]FNHNGEHR.C"   "HexNAc(1)" "Unambiguous"   "C[+57]*5, S[+203]" "7" "259"   ">>sp|P07996|TSP1_HUMAN Thrombospondin-1 OS=Homo sapiens OX=9606 GN=THBS1 PE=1 SV=2"    "1078"  "4" "4310"  "4310"  "0" "0" "572"   "Specific"  "259"   "259"   "175"   "2657"  "scan=144741"   "scan=144741"   "41"
"R.GKNWC[+57.021]AYVHT[+146.058]R.L"    "Fuc(1)"    "Unambiguous"   "C[+57], T[+146]"   "4" "221"   ">>sp|Q13201|MMRN1_HUMAN Multimerin-1 OS=Homo sapiens OX=9606 GN=MMRN1 PE=1 SV=3"   "513"   "3" "1537"  "1537"  "0" "0" "207"   "Specific"  "299"   "221"   "71"    "663"   "scan=142155"   "scan=142155"   "35"

Output file: human_proteoform_glycosylation_sites_platelet_olinked.csv

uniprotkb_canonical_ac,glycosylation_site_uniprotkb,amino_acid,saccharide,glycosylation_type,xref_key,xref_id,src_xref_key,src_xref_id,glycosylation_subtype,composition,cell_id,cell_name,site_type,Notes,start_pos,end_pos,start_aa,end_aa,site_seq
P02675-1,251,Thr,G81399MY,O-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,,Hex(1),CL:0000233,platelet,Known,Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications.,251,251,Thr,Thr,T
P07996-1,580,Ser,G29068FM,O-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,,HexNAc(1),CL:0000233,platelet,Known,Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications.,580,580,Ser,Ser,S
Q13201-1,216,Thr,G49112ZN,O-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,O-fucosylation,dHex(1),CL:0000234,platelet,Known,"Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications. Mutation of O-fucosylation sites within the EMI domain of MMRN1 affects secretion: T216A reduces secretion to at least 50%, whereas mutation of T1055A almost abolishes secretion. Fucosylation of these sites is carried out either by POFUT1 or a novel POFUT, but it is not carried out by POFUT2. Fucosylation of MMRN1 at T216, may represent a new POFUT1 O-fucosylation motif (C1-X-X-X-X-T-X) that is missing the typical C-terminal cysteine residue.",216,216,Thr,Thr,T

C-mannosylation sites

uniprotkb_canonical_ac,glycosylation_site_uniprotkb,amino_acid,saccharide,glycosylation_type,xref_key,xref_id,src_xref_key,src_xref_id,glycosylation_subtype,composition,cell_id,cell_name,site_type,Notes,start_pos,end_pos,start_aa,end_aa,site_seq
P07996-1,,Trp,G61491DK,C-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,C-mannosylation,Man(1),CL:0000233,platelet,Unknown,Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications.,372,402,Trp,Trp,W
P07996-1,,Trp,G61491DK,C-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,C-mannosylation,Man(1),CL:0000233,platelet,Unknown,Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications.,432,458,Trp,Trp,W
P07996-1,,Trp,G61491DK,C-linked,protein_xref_pubmed,38237698,protein_xref_glygen_ds,GLY_001051,C-mannosylation,Man(1),CL:0000233,platelet,Unknown,Thrombin-activated platelet releasate proteins were found to be enriched for a wide range of O-glycan modifications.,486,514,Trp,Trp,W

kmartinez834 commented 2 months ago

Looks good, just a couple things:

For the following row, I recommend making a mapping file for Robel. See /data/projects/glygen/generated/misc/pdc_glytoucan_mapping.csv as an example. You can name it platelet_o_linked_mapping.csv

Source Field	Output Field	Notes
Glycans NHFAGNa	saccharide	For these compositions in Glycans NHFAGNa, use the following GlyTouCan Acs: Hex(1) = G81399MY, Fuc(1) = G96881BQ, HexNac(1) = G29068FM, Hex(1)Fuc(1) = G42494UJ, Hex(1)Pent(2) = G42518JM, HexNAc(1)Hex(1)NeuAc(1) = G17015OC, HexNAc(1)Hex(1)NeuAc(2) = G23729WG

Input file listed is not formatted correctly. @ubhuiyan can you run the downloads/download_scripts/xlsx_to_tsv.py file (make sure to navigate to the downloads/user_submission/platelet_o_linked/current/ folder first. Please keep the csv files in this folder (not compiled per Robel)

ubhuiyan commented 2 months ago

@kmartinez834 and @katewarner My apologies for not doing this sooner. I've converted the user submitted data to tsv format. The new source should be as follows:

Source: downloads/user_submission/platelet_o_linked/current/1-s2.0-S1535947624000070-mmc7.tsv

glygener / glygen-issues