ubhuiyan commented 3 weeks ago

Proposed Headers:

~~source_cell_line_cellosaurus_id~~ ~~source_tissue_id~~ ~~end_pos~~ ~~source_tissue_name~~ xref_id ~~glycosylation_type~~ src_xref_id ~~start_aa~~ ~~end_aa~~ ~~saccharide~~ ~~amino_acid~~ ~~source_cell_line_cellosaurus_name~~ ~~n_sequon_type~~ uniprotkb_canonical_ac ~~n_sequon~~ xref_key src_xref_key ~~site_seq~~ ~~start_pos~~ ~~glycosylation_site_uniprotkb~~

ubhuiyan commented 3 weeks ago

@kmartinez834 I compared the glyconnect and the unicarbkb headers and pulled the common ones bc I figured we wanted those. It appears there's some unique column headers depending on what type of information the source file contains. Could you check to see whether what I have right now is alright? Then maybe how I might determine what unique headers I'd need.

kmartinez834 commented 3 weeks ago

Looks good, there's just a couple things missing. The following headers are present in the source file downloads/embl/current/glygen_upload.csv and need to be added (the proposed headers should match the glyconnect/unicarbkb files):

gene_name ~~taxonomy_id~~ ~~taxonomy_species~~ ~~composition~~ ~~glycan_mass~~ ~~glycan_type~~

Next, can you add to this table that tells which source file header will map to the output file header:

Drafted Ticket:

Source = glygen_upload.csv Output = *_proteoform_glycosylation_sites_embl.csv

Mapping Files: unreviewed/human_protein_masterlist.csv and unreviewed/mouse_protein_masterlist.csv misc/n_sequon_info.csv unreviewed/*_protein_glycosylation_motifs.csv

The output file should have the following headers:

uniprotkb_cannonical_ac, src_xref_id, xref_id, source_gene_name, composition, saccharide, glycosylation_site_uniprotkb, amino_acid, glycosylation_type, start_pos, end_pos, start_aa, end_aa, site_seq, taxonomy_id, taxonomy_species, glycan mass, glycan_type, abundance, source_tissue_id, source_tissue_name, source_cell_line_cellosaurus_id, source_cell_line_cellosaurus_name, n_sequon, n_sequon_type, evidence

Source Field	Output Field	Notes
uniprotkb_ac	uniprotkb_cannonical_ac	Map to canonical ac using *_protein_masterlist.csv fields "gene_name" and "uniprotkb_canonical_ac"
-	src_xref_id
-	xref_id
gene_name	source_gene_name	copy directly from Source
composition	composition	Extract comp string before " %": HexNAc(2)Hex(8) % 1702.5814
composition	saccharide	Robel to map GlyTouCan accession from composition string
glycosylation_site_uniprotkb	glycosylation_site_uniprotkb	copy directly from Source
amino_acid	amino_acid	Asn for all rows
glycosylation_type	glycosylation_type	N-linked for all rows
start_pos	start_pos	Copy directly from Source
end_pos	end_pos	Copy directly from Source
peptide	start_aa	the beginning aa abbreviation in "peptide" column from Source
peptide	end_aa	the end aa abbreviation in "peptide" column from Source
peptide	site_seq	Copy directly from Source
taxonomy_id	taxonomy_id	For human, all rows = 9606. For mouse all rows=10090.
taxonomy_species	taxonomy_species	For human, all rows =homo sapiens. For mouse, all rows = mus musculus.
glycan_mass	glycan mass	Copy directly from Source
glycan_type	glycan_type	Copy directly from Source
abundance	abundance	This column will be empty
source_tissue_id	source_tissue_id	Copy directly from Source
source_tissue	source_tissue_name	Copy directly from Source
source_cell_line_cellosaurus_id	source_cell_line_cellosaurus_id	copy directly from Source
source_cell_line_cellosaurus_name	source_cell_line_cellosaurus_name	copy directly from Source
-	n_sequon	Map using *_protein_glycosylation_motifs.csv
-	n_sequon_type	Map using misc/n_sequon_info.csv
evidence	evidence	copy directly from Source

--

ubhuiyan commented 3 weeks ago

@kmartinez834 I tried to fill the table out as best I could. I'm a little confused how I figure out where I would get the xref information.

Also, I'm assuming some columns will be empty. Should we include them anyways?

kmartinez834 commented 2 weeks ago

Proteoform dataset xrefs populate the evidence badges you see on the glycan and protein detail pages:

$ grep A1A5C7 reviewed/human_proteoform_glycosylation_sites_unicarbkb.csv
"A1A5C7-1","","Thr","G57321FI","O-linked","protein_xref_unicarbkb","A1A5C7","protein_xref_unicarbkb","A1A5C7","142","143","Thr","Thr","TT","known_site","fuzzy_legacy_cph_start_end_positions","G57321FI","HexNAc1Hex0dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","Data provided by GlycoDomainViewer","tcl_vva_try with KO entrez:29071","","","","","","CVCL_0553","T-47D","",""
"A1A5C7-1","","Thr","G57321FI","O-linked","protein_xref_pubmed","23584533","protein_xref_unicarbkb","A1A5C7","142","143","Thr","Thr","TT","known_site","fuzzy_legacy_cph_start_end_positions","G57321FI","HexNAc1Hex0dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","Data provided by GlycoDomainViewer","tcl_vva_try with KO entrez:29071","","","","","","CVCL_0553","T-47D","",""
"A1A5C7-1","","Thr","G57321FI","O-linked","protein_xref_doi","10.1038/emboj.2013.79","protein_xref_unicarbkb","A1A5C7","142","143","Thr","Thr","TT","known_site","fuzzy_legacy_cph_start_end_positions","G57321FI","HexNAc1Hex0dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","Data provided by GlycoDomainViewer","tcl_vva_try with KO entrez:29071","","","","","","CVCL_0553","T-47D","",""

👉 In the dataset you have essentially three of the same row, with the only difference being the xref_key and xref_id. This translates to one row in the "Assoicated Protein" section of the Glycan Details page, with 3 different evidence sources:

Ex. https://glygen.org/glycan/G57321FI

👉 In the API, you'll get 3 different evidence objects:

https://api.glygen.org/glycan/detail/G57321FI

{
      "uniprot_canonical_ac": "A1A5C7-1",
      "evidence": [
        {
          "database": "UniCarbKB",
          "id": "A1A5C7"
        },
        {
          "database": "PubMed",
          "id": "23584533",
          "url": "https://glygen.org/publication/PubMed/23584533"
        },
        {
          "database": "DOI",
          "id": "10.1038/emboj.2013.79",
          "url": "https://glygen.org/publication/DOI/10.1038/emboj.2013.79"
        }

👉 In the "Glycosylation" section of the Protein Details page, there are also 3 different evidence sources:

Ex. https://glygen.org/protein/A1A5C7-1

👉 In the API, you'll get 3 different evidence objects:

https://api.glygen.org/protein/detail/A1A5C7-1

{
      "glytoucan_ac": "G57321FI",
      "type": "O-linked",
      "site_category": "reported_with_glycan",
      "site_seq": "TT",
      "relation": "attached",
      "comment": "Data provided by GlycoDomainViewer",
      "start_pos": 142,
      "start_aa": "Thr",
      "end_pos": 143,
      "end_aa": "Thr",
      "evidence": [
        {
          "id": "A1A5C7",
          "database": "UniCarbKB"
        },
        {
          "id": "23584533",
          "database": "PubMed",
          "url": "https://glygen.org/publication/PubMed/23584533"
        },
        {
          "id": "10.1038/emboj.2013.79",
          "database": "DOI",
          "url": "https://glygen.org/publication/DOI/10.1038/emboj.2013.79"
        }

🌟 One final note: The src_xref_key and src_xref_id fields refer to the source database's accession for that row. Since the EMBL dataset is provided by a lab (and not a database), we won't have these fields in the new datasets.

kmartinez834 commented 2 weeks ago

Here are my recommended changes to the drafted ticket:

Add to mapping files: glytoucan/current/export/names.tsv

Output headers should be:

"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","source_glycan_type","xref_key","xref_id","source_gene_name","start_pos","end_pos","start_aa","end_aa","site_seq","composition","composition_mass","source_tissue_id","source_tissue_name","source_cell_line_cellosaurus_id","source_cell_line_cellosaurus_name","n_sequon","n_sequon_type"

Removed: abundance, glycan mass, glycan_type, src_xref_id, taxonomy_id, taxonomy_species Added: source_glycan_type, composition_mass, xref_key

Order table same as output headers list. I made a few changes to the notes/fields below also:

Source Field	Output Field	Notes
uniprotkb_ac (or gene_name if uniprotkb_ac can't be mapped)	uniprotkb_cannonical_ac	Map to canonical ac using *_protein_masterlist.csv field "uniprotkb_canonical_ac" (or use "gene_name" if uniprotkb_ac can't be mapped)
glycosylation_site_uniprotkb	glycosylation_site_uniprotkb	Copy directly from Source
amino_acid	amino_acid	Copy directly from source
composition	saccharide	Map from Byonic composition string to GlyTouCan accession using names.tsv
glycosylation_type	glycosylation_type	Copy directly from Source
glycan_type	source_glycan_type	Copy directly from source
	xref_key	All rows: "protein_xref_doi"
	xref_id	All rows: "10.1101/2023.09.13.557529v1"
gene_name	source_gene_name	Copy directly from Source
glycosylation_site_uniprotkb	start_pos	Copy directly from Source
glycosylation_site_uniprotkb	end_pos	Copy directly from Source
amino_acid	start_aa	Copy directly from Source
amino_acid	end_aa	Copy directly from Source
peptide	site_seq	Copy directly from Source
composition	composition	Extract comp string before " %" Ex. HexNAc(2)Hex(8) % 1702.5814
glycan_mass	composition_mass	Copy directly from Source
source_tissue_id	source_tissue_id	Copy directly from Source
source_tissue	source_tissue_name	Copy directly from Source
source_cell_line_cellosaurus_id	source_cell_line_cellosaurus_id	copy directly from Source
source_cell_line_cellosaurus_name	source_cell_line_cellosaurus_name	copy directly from Source
	n_sequon	Map using *_protein_glycosylation_motifs.csv
	n_sequon_type	Map using misc/n_sequon_info.csv

Let me know if you have any questions. When you're ready, create a new ticket for Robel with the processing instructions.

kmartinez834 commented 2 weeks ago

glygener / glygen-issues

Proposed EMBL Headers #1273

Drafted Ticket:

1287