kmartinez834 commented 5 months ago

Per Raja, UniCarbKB evidence badges should say "UniCarbKB" but point to GlyGen datasets. I updated the xref_info.csv file so there are now four unicarbkb xref id's: Two are disabled so we can display in the "Cross References" section without a link, and two point to datasets for evidence badges.

protein_xref_unicarbkb,UniCarbKB,DISABLED,
glycan_xref_unicarbkb,UniCarbKB,DISABLED,
protein_xref_unicarbkb_ds,UniCarbKB,http://data.glygen.org/%s,GLY_000040|GLY_000041|GLY_000221
glycan_xref_unicarbkb_ds,UniCarbKB,http://data.glygen.org/%s,GLY_000040|GLY_000041|GLY_000221

Please make the following changes:

1. *_proteoform_glycosylation_sites_unicarbkb.csv and *_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv:

Remove rows where both "xref_key" and "src_xref_key" are "protein_xref_unicarbkb"

$ head -4 unreviewed/rat_proteoform_glycosylation_sites_unicarbkb.csv
"uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","xref_key","xref_id","src_xref_key","src_xref_id","start_pos","end_pos","start_aa","end_aa","site_seq","site_type","src_file_name","uckb_id","composition","curation_notes","additional_notes","pdb","swiss_model","abundance","source_tissue_id","source_tissue_name","source_cell_line_cellosaurus_id","source_cell_line_cellosaurus_name","n_sequon","n_sequon_type"
"P01830-1","42","Asn","G31852PQ","N-linked","protein_xref_unicarbkb","P01830","protein_xref_unicarbkb","P01830","42","42","Asn","Asn","N","known_site","known_34106099_glygen","G31852PQ","HexNAc2Hex7dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","","40","","","","OMIT:0014437","Synaptosomes","","","NNT","NXT"
"P01830-1","42","Asn","G31852PQ","N-linked","protein_xref_pubmed","34106099","protein_xref_unicarbkb","P01830","42","42","Asn","Asn","N","known_site","known_34106099_glygen","G31852PQ","HexNAc2Hex7dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","","40","","","","OMIT:0014437","Synaptosomes","","","NNT","NXT"
"P01830-1","42","Asn","G31852PQ","N-linked","protein_xref_doi","10.1039/D0MO00044B","protein_xref_unicarbkb","P01830","42","42","Asn","Asn","N","known_site","known_34106099_glygen","G31852PQ","HexNAc2Hex7dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","","40","","","","OMIT:0014437","Synaptosomes","","","NNT","NXT"

Change "src_xref_key" from "protein_xref_unicarbkb" to "protein_xref_unicarbkb_ds"
Replace "src_xref_id" with species-specific GlyGen dataset:

species	BCO id
human	GLY_000040
mouse	GLY_000041
rat	GLY_000221
sarscov1	GLY_000628
sarscov2	GLY_000479
human(glycomic)	GLY_000611
rat(glycomic)	GLY_000733

2. glycan_species.csv

For rows where "source" == "UniCarbKB", make"src_key" = "glycan_xref_unicarbkb_ds"

3. glycan_species_customized_neuac_neugc.csv:

Populate rows with with "xref_key" and "xref_id" from source file


# if input row looks like this ---->
$ grep -i G05528SJ.*unicarbkb /data/projects/glygen/generated/datasets/unreviewed/glycan_species.csv
"G05528SJ","2697049","Severe acute respiratory syndrome coronavirus 2","Direct","UniCarbKB","comp_HexNAc11Hex11dHex0NeuAc1NeuGc0Pent0S0P0KDN0HexA0","comp_HexNAc11Hex11dHex0NeuAc1NeuGc0Pent0S0P0KDN0HexA0","glycan_xref_unicarbkb_ds","GLY_000341","False"

make output rows look like this ---->

$ grep G05528SJ.*UniCarbKB unreviewed/glycan_species_customized_neuac_neugc.csv "G05528SJ","2697049","Severe acute respiratory syndrome coronavirus 2","Direct","UniCarbKB","comp_HexNAc11Hex11dHex0NeuAc1NeuGc0Pent0S0P0KDN0HexA0","comp_HexNAc11Hex11dHex0NeuAc1NeuGc0Pent0S0P0KDN0HexA0","glycan_xref_unicarbkb_ds","GLY_000341","False"


**4. \*_proteoform_glycosylation_sites_literature.csv:**
- For rows where `"xref_key" == "protein_xref_unicarbkb"`, make `"xref_key' = "protein_xref_unicarbkb_ds"` and replace `"xref_id"` with BCO id

|species|BCO id|
|----|----|
|human|GLY_000143|
|hcv1a|GLY_000335|
|sarscov1|GLY_000612|

make output rows look like this ---->

$ awk -F, '{print $6,$7,$8,$9}' unreviewed/hcv1a_proteoform_glycosylation_sites_literature.csv | head -3 "xref_key" "xref_id" "src_xref_key" "src_xref_id" "protein_xref_pubmed" "18187336" "protein_xref_glygen_ds" "GLY_000335" "protein_xref_unicarbkb_ds" "GLY_000335" "protein_xref_glygen_ds" "GLY_000335"



- In sarscov1_proteoform_glycosylation_sites_literature.csv, replace all instances of "GLY_000510" with "GLY_000612" (GLY_000510 is not a valid BCO id)

**5. Reprocess all citations datasets for the above files**

kmartinez834 commented 5 months ago

@rykahsay I have finished documenting all of the UniCarbKB xref changes above

rykahsay commented 5 months ago

Why are you not using generated/misc/ds2bco.json to put the mapping between dataset filename and BCOID? Please re-write the instructions and make sure you have placed the mappings in ds2bco.json file.

$ cat generated/misc/ds2bco.json 
{
   "human_protein_biomarkers":"GLY_000625",
   "human_proteoform_glycosylation_sites_embl":"GLY_000888",
   "mouse_proteoform_glycosylation_sites_embl":"GLY_000889",
   "human_proteoform_glycosylation_sites_diabetes_glycomic":"GLY_000960",
   "human_proteoform_glycosylation_sites_literature_mining":"GLY_000481",
   "mouse_proteoform_glycosylation_sites_literature_mining":"GLY_000492",
   "rat_proteoform_glycosylation_sites_literature_mining":"GLY_000493",
   "human_proteoform_glycosylation_sites_literature_mining_manually_verified":"GLY_000481",
   "mouse_proteoform_glycosylation_sites_literature_mining_manually_verified":"GLY_000492",
   "rat_proteoform_glycosylation_sites_literature_mining_manually_verified":"GLY_000493",
   "human_proteoform_glycosylation_sites_literature":"GLY_000143",
   "hcv1a_proteoform_glycosylation_sites_literature":"GLY_000335",
   "sarscov1_proteoform_glycosylation_sites_literature":"GLY_000510",
   "human_proteoform_glycosylation_sites_o_gluc":"GLY_000716",
   "glycan_biomarkers":"GLY_000737",
   "glycan_citations_ncfg":"GLY_000538",
   "glycan_citations_glytoucan":"GLY_000285",
   "glycan_synthesized":"GLY_000309"
}

kmartinez834 commented 5 months ago

We can't use ds2bco.json for UniCarbKB because Raja wants the evidence badge to say "UniCarbKB"

If we use protein_xref_glygen_ds/glycan_xref_glygen_ds as the xref_key in these datasets, the badge will say "GlyGen"

rykahsay commented 5 months ago

I want all dataset-2-bco mappings in ds2bco.json, please put them there.

I will make sure it will not say "protein_xref_glygen_ds/glycan_xref_glygen_ds" for UniCarbKB datasets

kmartinez834 commented 5 months ago

Ok great - I have updated ds2bco.json

rykahsay commented 5 months ago

I have modified these datasets now -- please check:

glycan_species.csv glycan_species_customized_neuac_neugc.csv glycan_citations_glytoucan.csv

human_proteoform_glycosylation_sites_unicarbkb.csv mouse_proteoform_glycosylation_sites_unicarbkb.csv rat_proteoform_glycosylation_sites_unicarbkb.csv sarscov1_proteoform_glycosylation_sites_unicarbkb.csv sarscov2_proteoform_glycosylation_sites_unicarbkb.csv human_proteoform_citations_glycosylation_sites_unicarbkb.csv mouse_proteoform_citations_glycosylation_sites_unicarbkb.csv rat_proteoform_citations_glycosylation_sites_unicarbkb.csv sarscov1_proteoform_citations_glycosylation_sites_unicarbkb.csv sarscov2_proteoform_citations_glycosylation_sites_unicarbkb.csv

human_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv rat_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv human_proteoform_citations_glycosylation_sites_unicarbkb_glycomics_study.csv rat_proteoform_citations_glycosylation_sites_unicarbkb_glycomics_study.csv

human_proteoform_glycosylation_sites_literature hcv1a_proteoform_glycosylation_sites_literature.csv sarscov1_proteoform_glycosylation_sites_literature.csv

hcv1a_proteoform_citations_glycosylation_sites_literature.csv human_proteoform_citations_glycosylation_sites_literature.csv sarscov1_proteoform_citations_glycosylation_sites_literature.csv

kmartinez834 commented 5 months ago

All of the datasets look good. The following were missing from ds2bco.json, so I just added them: "glycan_species":"GLY_000341", "glycan_species_customized_neuac_neugc":"GLY_000341"

glygener / glygen-issues

UniCarbKB xref updates #1318

make output rows look like this ---->

make output rows look like this ---->