glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

New dataset: human_proteoform_glycosylation_sites_embl.csv and mouse_proteoform_glycosylation_sites_embl.csv #1287

Open ubhuiyan opened 2 weeks ago

ubhuiyan commented 2 weeks ago

Source = glygen_upload.csv Output = *_proteoform_glycosylation_sites_embl.csv

Mapping Files: unreviewed/human_protein_masterlist.csv and unreviewed/mouse_protein_masterlist.csv misc/n_sequon_info.csv unreviewed/*_protein_glycosylation_motifs.csv glytoucan/current/export/names.tsv

"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","source_glycan_type","xref_key","xref_id","source_gene_name","start_pos","end_pos","start_aa","end_aa","site_seq","composition","composition_mass","source_tissue_id","source_tissue_name","source_cell_line_cellosaurus_id","source_cell_line_cellosaurus_name","n_sequon","n_sequon_type"

Output Files: human_proteoform_glycosylation_sites_embl.csv mouse_proteoform_glycosylation_sites_embl.csv

The output file should have the following headers: Source Field Output Field Notes
uniprotkb_ac
(or gene_name if uniprotkb_ac can't be mapped)
uniprotkb_cannonical_ac Map to canonical ac using *_protein_masterlist.csv field "uniprotkb_canonical_ac" (or use "gene_name" if uniprotkb_ac can't be mapped)
glycosylation_site_uniprotkb glycosylation_site_uniprotkb Copy directly from Source
amino_acid amino_acid Copy directly from source
composition saccharide Map from Byonic composition string to GlyTouCan accession using names.tsv
glycosylation_type glycosylation_type Copy directly from Source
glycan_type source_glycan_type Copy directly from source
xref_key All rows: "protein_xref_doi"
xref_id All rows: "10.1101/2023.09.13.557529v1"
gene_name source_gene_name Copy directly from Source
glycosylation_site_uniprotkb start_pos Copy directly from Source
glycosylation_site_uniprotkb end_pos Copy directly from Source
amino_acid start_aa Copy directly from Source
amino_acid end_aa Copy directly from Source
peptide site_seq Copy directly from Source
composition composition Extract comp string before " %"
Ex. HexNAc(2)Hex(8) % 1702.5814
glycan_mass composition_mass Copy directly from Source
source_tissue_id source_tissue_id Copy directly from Source
source_tissue source_tissue_name Copy directly from Source
source_cell_line_cellosaurus_id source_cell_line_cellosaurus_id copy directly from Source
source_cell_line_cellosaurus_name source_cell_line_cellosaurus_name copy directly from Source
n_sequon Map using *_protein_glycosylation_motifs.csv
n_sequon_type Map using misc/n_sequon_info.csv

Example:

Input File:

$ head -2 /data/projects/glygen/downloads/embl/current/glygen_upload.csv
"uniprotkb_ac","gene_name","glycosylation_site_uniprotkb","amino_acid","glycosylation_type","start_pos","end_pos","peptide","taxonomy_id","taxonomy_species","composition","glycan_mass","glycan_type","abundance","source_tissue_id","source_tissue","source_cell_line_cellosaurus_id","source_cell_line_cellosaurus_name","evidence"
"Q5JTV8","TOR1AIP1","399","Asn","N-linked","397","404","HLNSSHPR","9606","homo sapiens","HexNAc(2)Hex(8) % 1702.5814","1702.5814","high mannose","","UBERON:0002113","kidney","CVCL_0063","HEK293T","https://www.biorxiv.org/content/10.1101/2023.09.13.557529v1.full"

Output File:

uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","source_glycan_type","xref_key","xref_id","source_gene_name","start_pos","end_pos","start_aa","end_aa","site_seq","composition","composition_mass","source_tissue_id","source_tissue_name","source_cell_line_cellosaurus_id","source_cell_line_cellosaurus_name","n_sequon","n_sequon_type"
"Q5JTV8-1", "399", "Asn", "G62765YT", "N-linked", "high mannose", "protein_xref_doi", "10.1101/2023.09.13.557529v1", "TOR1AIP1", "397", "397", "Asn", "Asn", "HLNSSHPR", "HexNAc(2)Hex(8)", "1702.5814", "UBERON:0002113", "kidney", "CVCL_0063", "HEK293T", "NSS", "NXS"
rykahsay commented 2 weeks ago

@kmartinez834 --- passing reported glycosylation at the composition level to glytoucan level seems to be problematic to me? There are many glytoucans that have the same composition and this propagation of information is problematic.

We have similar data in unreviewed/human_proteoform_glycosylation_sites_unicarbkb.csv where saccharide="" and composition has value, and I don't think we are passing this glycosylation record to any glytoucan. Please look into this carefully.

$ cat unreviewed/human_proteoform_glycosylation_sites_unicarbkb.csv | awk -F"\",\"" '{print $4","$18}'  |grep ^, |sort -u |head
,
,comp_HexNAc0Hex0dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0
,comp_HexNAc1HexdHex0NeuAc1NeuGc0Pent0S0P0KDN0HexA0
,comp_HexNAc2Hex3dHex0NeuAc0Gc0Pent0S0P0KDN0HexA0
,comp_HexNAc2Hex3dHex1NeuAc0Gc0Pent0S0P0KDN0HexA0
,comp_HexNAc2Hex4dHex1NeuAc0Gc0Pent0S0P0KDN0HexA0
,comp_HexNAc2Hex5dHex0NeuAc0Gc0Pent0S0P0KDN0HexA0
,comp_HexNAc2Hex5dHex1NeuAc0Gc0Pent0S0P0KDN0HexA0
,comp_HexNAc2Hex6dHex0NeuAc0Gc0Pent0S0P0KDN0HexA0
,comp_HexNAc2Hex7dHex0NeuAc0Gc0Pent0S0P0KDN0HexA0
kmartinez834 commented 2 weeks ago

@rykahsay I can confirm that Glytoucan to Byonic name is 1:1 in names.tsv

kmartinez834 commented 1 week ago

Update the "src_xref_key" and "src_xref_id" values for all rows:

human_proteoform_glycosylation_sites_embl.csv:

"src_xref_key","src_xref_id"
"protein_xref_glygen_ds","GLY_000888"

mouse_proteoform_glycosylation_sites_embl.csv:

"src_xref_key","src_xref_id"
"protein_xref_glygen_ds","GLY_000889"
rykahsay commented 1 week ago

done, check the datasets for now and the effect will propagate after I make and push json objects to tst