glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Proposed EMBL Headers #1273

Closed ubhuiyan closed 2 weeks ago

ubhuiyan commented 3 weeks ago

Proposed Headers:

source_cell_line_cellosaurus_id source_tissue_id end_pos source_tissue_name xref_id glycosylation_type src_xref_id start_aa end_aa saccharide amino_acid source_cell_line_cellosaurus_name n_sequon_type uniprotkb_canonical_ac n_sequon xref_key src_xref_key site_seq start_pos glycosylation_site_uniprotkb

ubhuiyan commented 3 weeks ago

@kmartinez834 I compared the glyconnect and the unicarbkb headers and pulled the common ones bc I figured we wanted those. It appears there's some unique column headers depending on what type of information the source file contains. Could you check to see whether what I have right now is alright? Then maybe how I might determine what unique headers I'd need.

kmartinez834 commented 3 weeks ago

Looks good, there's just a couple things missing. The following headers are present in the source file downloads/embl/current/glygen_upload.csv and need to be added (the proposed headers should match the glyconnect/unicarbkb files):

gene_name taxonomy_id taxonomy_species composition glycan_mass glycan_type

Next, can you add to this table that tells which source file header will map to the output file header:

Drafted Ticket:

Source = glygen_upload.csv Output = *_proteoform_glycosylation_sites_embl.csv

Mapping Files: unreviewed/human_protein_masterlist.csv and unreviewed/mouse_protein_masterlist.csv misc/n_sequon_info.csv unreviewed/*_protein_glycosylation_motifs.csv

The output file should have the following headers:

uniprotkb_cannonical_ac, src_xref_id, xref_id, source_gene_name, composition, saccharide, glycosylation_site_uniprotkb, amino_acid, glycosylation_type, start_pos, end_pos, start_aa, end_aa, site_seq, taxonomy_id, taxonomy_species, glycan mass, glycan_type, abundance, source_tissue_id, source_tissue_name, source_cell_line_cellosaurus_id, source_cell_line_cellosaurus_name, n_sequon, n_sequon_type, evidence
Source Field Output Field Notes
uniprotkb_ac uniprotkb_cannonical_ac Map to canonical ac using *_protein_masterlist.csv fields "gene_name" and "uniprotkb_canonical_ac"
- src_xref_id
- xref_id
gene_name source_gene_name copy directly from Source
composition composition Extract comp string before " %": HexNAc(2)Hex(8) % 1702.5814
composition saccharide Robel to map GlyTouCan accession from composition string
glycosylation_site_uniprotkb glycosylation_site_uniprotkb copy directly from Source
amino_acid amino_acid Asn for all rows
glycosylation_type glycosylation_type N-linked for all rows
start_pos start_pos Copy directly from Source
end_pos end_pos Copy directly from Source
peptide start_aa the beginning aa abbreviation in "peptide" column from Source
peptide end_aa the end aa abbreviation in "peptide" column from Source
peptide site_seq Copy directly from Source
taxonomy_id taxonomy_id For human, all rows = 9606. For mouse all rows=10090.
taxonomy_species taxonomy_species For human, all rows =homo sapiens. For mouse, all rows = mus musculus.
glycan_mass glycan mass Copy directly from Source
glycan_type glycan_type Copy directly from Source
abundance abundance This column will be empty
source_tissue_id source_tissue_id Copy directly from Source
source_tissue source_tissue_name Copy directly from Source
source_cell_line_cellosaurus_id source_cell_line_cellosaurus_id copy directly from Source
source_cell_line_cellosaurus_name source_cell_line_cellosaurus_name copy directly from Source
- n_sequon Map using *_protein_glycosylation_motifs.csv
- n_sequon_type Map using misc/n_sequon_info.csv
evidence evidence copy directly from Source

--

ubhuiyan commented 3 weeks ago

@kmartinez834 I tried to fill the table out as best I could. I'm a little confused how I figure out where I would get the xref information.

Also, I'm assuming some columns will be empty. Should we include them anyways?

kmartinez834 commented 2 weeks ago

Proteoform dataset xrefs populate the evidence badges you see on the glycan and protein detail pages:

$ grep A1A5C7 reviewed/human_proteoform_glycosylation_sites_unicarbkb.csv
"A1A5C7-1","","Thr","G57321FI","O-linked","protein_xref_unicarbkb","A1A5C7","protein_xref_unicarbkb","A1A5C7","142","143","Thr","Thr","TT","known_site","fuzzy_legacy_cph_start_end_positions","G57321FI","HexNAc1Hex0dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","Data provided by GlycoDomainViewer","tcl_vva_try with KO entrez:29071","","","","","","CVCL_0553","T-47D","",""
"A1A5C7-1","","Thr","G57321FI","O-linked","protein_xref_pubmed","23584533","protein_xref_unicarbkb","A1A5C7","142","143","Thr","Thr","TT","known_site","fuzzy_legacy_cph_start_end_positions","G57321FI","HexNAc1Hex0dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","Data provided by GlycoDomainViewer","tcl_vva_try with KO entrez:29071","","","","","","CVCL_0553","T-47D","",""
"A1A5C7-1","","Thr","G57321FI","O-linked","protein_xref_doi","10.1038/emboj.2013.79","protein_xref_unicarbkb","A1A5C7","142","143","Thr","Thr","TT","known_site","fuzzy_legacy_cph_start_end_positions","G57321FI","HexNAc1Hex0dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","Data provided by GlycoDomainViewer","tcl_vva_try with KO entrez:29071","","","","","","CVCL_0553","T-47D","",""

👉 In the dataset you have essentially three of the same row, with the only difference being the xref_key and xref_id. This translates to one row in the "Assoicated Protein" section of the Glycan Details page, with 3 different evidence sources:

Ex. https://glygen.org/glycan/G57321FI image

👉 In the API, you'll get 3 different evidence objects:

https://api.glygen.org/glycan/detail/G57321FI

{
      "uniprot_canonical_ac": "A1A5C7-1",
      "evidence": [
        {
          "database": "UniCarbKB",
          "id": "A1A5C7"
        },
        {
          "database": "PubMed",
          "id": "23584533",
          "url": "https://glygen.org/publication/PubMed/23584533"
        },
        {
          "database": "DOI",
          "id": "10.1038/emboj.2013.79",
          "url": "https://glygen.org/publication/DOI/10.1038/emboj.2013.79"
        }

👉 In the "Glycosylation" section of the Protein Details page, there are also 3 different evidence sources:

Ex. https://glygen.org/protein/A1A5C7-1 image

👉 In the API, you'll get 3 different evidence objects:

https://api.glygen.org/protein/detail/A1A5C7-1

{
      "glytoucan_ac": "G57321FI",
      "type": "O-linked",
      "site_category": "reported_with_glycan",
      "site_seq": "TT",
      "relation": "attached",
      "comment": "Data provided by GlycoDomainViewer",
      "start_pos": 142,
      "start_aa": "Thr",
      "end_pos": 143,
      "end_aa": "Thr",
      "evidence": [
        {
          "id": "A1A5C7",
          "database": "UniCarbKB"
        },
        {
          "id": "23584533",
          "database": "PubMed",
          "url": "https://glygen.org/publication/PubMed/23584533"
        },
        {
          "id": "10.1038/emboj.2013.79",
          "database": "DOI",
          "url": "https://glygen.org/publication/DOI/10.1038/emboj.2013.79"
        }

🌟 One final note: The src_xref_key and src_xref_id fields refer to the source database's accession for that row. Since the EMBL dataset is provided by a lab (and not a database), we won't have these fields in the new datasets.

kmartinez834 commented 2 weeks ago

Here are my recommended changes to the drafted ticket:

Source Field Output Field Notes
uniprotkb_ac
(or gene_name if uniprotkb_ac can't be mapped)
uniprotkb_cannonical_ac Map to canonical ac using *_protein_masterlist.csv field "uniprotkb_canonical_ac" (or use "gene_name" if uniprotkb_ac can't be mapped)
glycosylation_site_uniprotkb glycosylation_site_uniprotkb Copy directly from Source
amino_acid amino_acid Copy directly from source
composition saccharide Map from Byonic composition string to GlyTouCan accession using names.tsv
glycosylation_type glycosylation_type Copy directly from Source
glycan_type source_glycan_type Copy directly from source
xref_key All rows: "protein_xref_doi"
xref_id All rows: "10.1101/2023.09.13.557529v1"
gene_name source_gene_name Copy directly from Source
glycosylation_site_uniprotkb start_pos Copy directly from Source
glycosylation_site_uniprotkb end_pos Copy directly from Source
amino_acid start_aa Copy directly from Source
amino_acid end_aa Copy directly from Source
peptide site_seq Copy directly from Source
composition composition Extract comp string before " %"
Ex. HexNAc(2)Hex(8) % 1702.5814
glycan_mass composition_mass Copy directly from Source
source_tissue_id source_tissue_id Copy directly from Source
source_tissue source_tissue_name Copy directly from Source
source_cell_line_cellosaurus_id source_cell_line_cellosaurus_id copy directly from Source
source_cell_line_cellosaurus_name source_cell_line_cellosaurus_name copy directly from Source
n_sequon Map using *_protein_glycosylation_motifs.csv
n_sequon_type Map using misc/n_sequon_info.csv

Let me know if you have any questions. When you're ready, create a new ticket for Robel with the processing instructions.

kmartinez834 commented 2 weeks ago

1287