Closed ubhuiyan closed 2 weeks ago
@kmartinez834 I compared the glyconnect and the unicarbkb headers and pulled the common ones bc I figured we wanted those. It appears there's some unique column headers depending on what type of information the source file contains. Could you check to see whether what I have right now is alright? Then maybe how I might determine what unique headers I'd need.
Looks good, there's just a couple things missing. The following headers are present in the source file downloads/embl/current/glygen_upload.csv and need to be added (the proposed headers should match the glyconnect/unicarbkb files):
gene_name
taxonomy_id
taxonomy_species
composition
glycan_mass
glycan_type
Next, can you add to this table that tells which source file header will map to the output file header:
Source = glygen_upload.csv Output = *_proteoform_glycosylation_sites_embl.csv
Mapping Files: unreviewed/human_protein_masterlist.csv and unreviewed/mouse_protein_masterlist.csv misc/n_sequon_info.csv unreviewed/*_protein_glycosylation_motifs.csv
The output file should have the following headers:
uniprotkb_cannonical_ac, src_xref_id, xref_id, source_gene_name, composition, saccharide, glycosylation_site_uniprotkb, amino_acid, glycosylation_type, start_pos, end_pos, start_aa, end_aa, site_seq, taxonomy_id, taxonomy_species, glycan mass, glycan_type, abundance, source_tissue_id, source_tissue_name, source_cell_line_cellosaurus_id, source_cell_line_cellosaurus_name, n_sequon, n_sequon_type, evidence
Source Field | Output Field | Notes |
---|---|---|
uniprotkb_ac | uniprotkb_cannonical_ac | Map to canonical ac using *_protein_masterlist.csv fields "gene_name" and "uniprotkb_canonical_ac" |
- | src_xref_id | |
- | xref_id | |
gene_name | source_gene_name | copy directly from Source |
composition | composition | Extract comp string before " %": HexNAc(2)Hex(8) % 1702.5814 |
composition | saccharide | Robel to map GlyTouCan accession from composition string |
glycosylation_site_uniprotkb | glycosylation_site_uniprotkb | copy directly from Source |
amino_acid | amino_acid | Asn for all rows |
glycosylation_type | glycosylation_type | N-linked for all rows |
start_pos | start_pos | Copy directly from Source |
end_pos | end_pos | Copy directly from Source |
peptide | start_aa | the beginning aa abbreviation in "peptide" column from Source |
peptide | end_aa | the end aa abbreviation in "peptide" column from Source |
peptide | site_seq | Copy directly from Source |
taxonomy_id | taxonomy_id | For human, all rows = 9606. For mouse all rows=10090. |
taxonomy_species | taxonomy_species | For human, all rows =homo sapiens. For mouse, all rows = mus musculus. |
glycan_mass | glycan mass | Copy directly from Source |
glycan_type | glycan_type | Copy directly from Source |
abundance | abundance | This column will be empty |
source_tissue_id | source_tissue_id | Copy directly from Source |
source_tissue | source_tissue_name | Copy directly from Source |
source_cell_line_cellosaurus_id | source_cell_line_cellosaurus_id | copy directly from Source |
source_cell_line_cellosaurus_name | source_cell_line_cellosaurus_name | copy directly from Source |
- | n_sequon | Map using *_protein_glycosylation_motifs.csv |
- | n_sequon_type | Map using misc/n_sequon_info.csv |
evidence | evidence | copy directly from Source |
--
@kmartinez834 I tried to fill the table out as best I could. I'm a little confused how I figure out where I would get the xref information.
Also, I'm assuming some columns will be empty. Should we include them anyways?
Proteoform dataset xrefs populate the evidence badges you see on the glycan and protein detail pages:
$ grep A1A5C7 reviewed/human_proteoform_glycosylation_sites_unicarbkb.csv
"A1A5C7-1","","Thr","G57321FI","O-linked","protein_xref_unicarbkb","A1A5C7","protein_xref_unicarbkb","A1A5C7","142","143","Thr","Thr","TT","known_site","fuzzy_legacy_cph_start_end_positions","G57321FI","HexNAc1Hex0dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","Data provided by GlycoDomainViewer","tcl_vva_try with KO entrez:29071","","","","","","CVCL_0553","T-47D","",""
"A1A5C7-1","","Thr","G57321FI","O-linked","protein_xref_pubmed","23584533","protein_xref_unicarbkb","A1A5C7","142","143","Thr","Thr","TT","known_site","fuzzy_legacy_cph_start_end_positions","G57321FI","HexNAc1Hex0dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","Data provided by GlycoDomainViewer","tcl_vva_try with KO entrez:29071","","","","","","CVCL_0553","T-47D","",""
"A1A5C7-1","","Thr","G57321FI","O-linked","protein_xref_doi","10.1038/emboj.2013.79","protein_xref_unicarbkb","A1A5C7","142","143","Thr","Thr","TT","known_site","fuzzy_legacy_cph_start_end_positions","G57321FI","HexNAc1Hex0dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","Data provided by GlycoDomainViewer","tcl_vva_try with KO entrez:29071","","","","","","CVCL_0553","T-47D","",""
👉 In the dataset you have essentially three of the same row, with the only difference being the xref_key and xref_id. This translates to one row in the "Assoicated Protein" section of the Glycan Details page, with 3 different evidence sources:
Ex. https://glygen.org/glycan/G57321FI
👉 In the API, you'll get 3 different evidence objects:
https://api.glygen.org/glycan/detail/G57321FI
{
"uniprot_canonical_ac": "A1A5C7-1",
"evidence": [
{
"database": "UniCarbKB",
"id": "A1A5C7"
},
{
"database": "PubMed",
"id": "23584533",
"url": "https://glygen.org/publication/PubMed/23584533"
},
{
"database": "DOI",
"id": "10.1038/emboj.2013.79",
"url": "https://glygen.org/publication/DOI/10.1038/emboj.2013.79"
}
👉 In the "Glycosylation" section of the Protein Details page, there are also 3 different evidence sources:
Ex. https://glygen.org/protein/A1A5C7-1
👉 In the API, you'll get 3 different evidence objects:
https://api.glygen.org/protein/detail/A1A5C7-1
{
"glytoucan_ac": "G57321FI",
"type": "O-linked",
"site_category": "reported_with_glycan",
"site_seq": "TT",
"relation": "attached",
"comment": "Data provided by GlycoDomainViewer",
"start_pos": 142,
"start_aa": "Thr",
"end_pos": 143,
"end_aa": "Thr",
"evidence": [
{
"id": "A1A5C7",
"database": "UniCarbKB"
},
{
"id": "23584533",
"database": "PubMed",
"url": "https://glygen.org/publication/PubMed/23584533"
},
{
"id": "10.1038/emboj.2013.79",
"database": "DOI",
"url": "https://glygen.org/publication/DOI/10.1038/emboj.2013.79"
}
🌟 One final note: The src_xref_key and src_xref_id fields refer to the source database's accession for that row. Since the EMBL dataset is provided by a lab (and not a database), we won't have these fields in the new datasets.
Here are my recommended changes to the drafted ticket:
Add to mapping files: glytoucan/current/export/names.tsv
Output headers should be:
"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","source_glycan_type","xref_key","xref_id","source_gene_name","start_pos","end_pos","start_aa","end_aa","site_seq","composition","composition_mass","source_tissue_id","source_tissue_name","source_cell_line_cellosaurus_id","source_cell_line_cellosaurus_name","n_sequon","n_sequon_type"
Removed: abundance, glycan mass, glycan_type, src_xref_id, taxonomy_id, taxonomy_species Added: source_glycan_type, composition_mass, xref_key
Order table same as output headers list. I made a few changes to the notes/fields below also:
Source Field | Output Field | Notes |
---|---|---|
uniprotkb_ac (or gene_name if uniprotkb_ac can't be mapped) |
uniprotkb_cannonical_ac | Map to canonical ac using *_protein_masterlist.csv field "uniprotkb_canonical_ac" (or use "gene_name" if uniprotkb_ac can't be mapped) |
glycosylation_site_uniprotkb | glycosylation_site_uniprotkb | Copy directly from Source |
amino_acid | amino_acid | Copy directly from source |
composition | saccharide | Map from Byonic composition string to GlyTouCan accession using names.tsv |
glycosylation_type | glycosylation_type | Copy directly from Source |
glycan_type | source_glycan_type | Copy directly from source |
xref_key | All rows: "protein_xref_doi" | |
xref_id | All rows: "10.1101/2023.09.13.557529v1" | |
gene_name | source_gene_name | Copy directly from Source |
glycosylation_site_uniprotkb | start_pos | Copy directly from Source |
glycosylation_site_uniprotkb | end_pos | Copy directly from Source |
amino_acid | start_aa | Copy directly from Source |
amino_acid | end_aa | Copy directly from Source |
peptide | site_seq | Copy directly from Source |
composition | composition | Extract comp string before " %" Ex. HexNAc(2)Hex(8) % 1702.5814 |
glycan_mass | composition_mass | Copy directly from Source |
source_tissue_id | source_tissue_id | Copy directly from Source |
source_tissue | source_tissue_name | Copy directly from Source |
source_cell_line_cellosaurus_id | source_cell_line_cellosaurus_id | copy directly from Source |
source_cell_line_cellosaurus_name | source_cell_line_cellosaurus_name | copy directly from Source |
n_sequon | Map using *_protein_glycosylation_motifs.csv | |
n_sequon_type | Map using misc/n_sequon_info.csv |
Let me know if you have any questions. When you're ready, create a new ticket for Robel with the processing instructions.
Proposed Headers:
source_cell_line_cellosaurus_idsource_tissue_idend_possource_tissue_namexref_idglycosylation_typesrc_xref_idstart_aaend_aasaccharideamino_acidsource_cell_line_cellosaurus_namen_sequon_typeuniprotkb_canonical_acn_sequonxref_key src_xref_keysite_seqstart_posglycosylation_site_uniprotkb