glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

UniCarbKB xref updates #1318

Closed kmartinez834 closed 5 months ago

kmartinez834 commented 5 months ago

Per Raja, UniCarbKB evidence badges should say "UniCarbKB" but point to GlyGen datasets. I updated the xref_info.csv file so there are now four unicarbkb xref id's: Two are disabled so we can display in the "Cross References" section without a link, and two point to datasets for evidence badges.

protein_xref_unicarbkb,UniCarbKB,DISABLED,
glycan_xref_unicarbkb,UniCarbKB,DISABLED,
protein_xref_unicarbkb_ds,UniCarbKB,http://data.glygen.org/%s,GLY_000040|GLY_000041|GLY_000221
glycan_xref_unicarbkb_ds,UniCarbKB,http://data.glygen.org/%s,GLY_000040|GLY_000041|GLY_000221

Please make the following changes:

1. *_proteoform_glycosylation_sites_unicarbkb.csv and *_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv:

species BCO id
human GLY_000040
mouse GLY_000041
rat GLY_000221
sarscov1 GLY_000628
sarscov2 GLY_000479
human(glycomic) GLY_000611
rat(glycomic) GLY_000733

2. glycan_species.csv

3. glycan_species_customized_neuac_neugc.csv:

make output rows look like this ---->

$ grep G05528SJ.*UniCarbKB unreviewed/glycan_species_customized_neuac_neugc.csv "G05528SJ","2697049","Severe acute respiratory syndrome coronavirus 2","Direct","UniCarbKB","comp_HexNAc11Hex11dHex0NeuAc1NeuGc0Pent0S0P0KDN0HexA0","comp_HexNAc11Hex11dHex0NeuAc1NeuGc0Pent0S0P0KDN0HexA0","glycan_xref_unicarbkb_ds","GLY_000341","False"


**4. \*_proteoform_glycosylation_sites_literature.csv:**
- For rows where `"xref_key" == "protein_xref_unicarbkb"`, make `"xref_key' = "protein_xref_unicarbkb_ds"` and replace `"xref_id"` with BCO id

|species|BCO id|
|----|----|
|human|GLY_000143|
|hcv1a|GLY_000335|
|sarscov1|GLY_000612|

make output rows look like this ---->

$ awk -F, '{print $6,$7,$8,$9}' unreviewed/hcv1a_proteoform_glycosylation_sites_literature.csv | head -3 "xref_key" "xref_id" "src_xref_key" "src_xref_id" "protein_xref_pubmed" "18187336" "protein_xref_glygen_ds" "GLY_000335" "protein_xref_unicarbkb_ds" "GLY_000335" "protein_xref_glygen_ds" "GLY_000335"



- In sarscov1_proteoform_glycosylation_sites_literature.csv, replace all instances of "GLY_000510" with "GLY_000612" (GLY_000510 is not a valid BCO id)

**5. Reprocess all citations datasets for the above files**
kmartinez834 commented 5 months ago

@rykahsay I have finished documenting all of the UniCarbKB xref changes above

rykahsay commented 5 months ago

Why are you not using generated/misc/ds2bco.json to put the mapping between dataset filename and BCOID? Please re-write the instructions and make sure you have placed the mappings in ds2bco.json file.

$ cat generated/misc/ds2bco.json 
{
   "human_protein_biomarkers":"GLY_000625",
   "human_proteoform_glycosylation_sites_embl":"GLY_000888",
   "mouse_proteoform_glycosylation_sites_embl":"GLY_000889",
   "human_proteoform_glycosylation_sites_diabetes_glycomic":"GLY_000960",
   "human_proteoform_glycosylation_sites_literature_mining":"GLY_000481",
   "mouse_proteoform_glycosylation_sites_literature_mining":"GLY_000492",
   "rat_proteoform_glycosylation_sites_literature_mining":"GLY_000493",
   "human_proteoform_glycosylation_sites_literature_mining_manually_verified":"GLY_000481",
   "mouse_proteoform_glycosylation_sites_literature_mining_manually_verified":"GLY_000492",
   "rat_proteoform_glycosylation_sites_literature_mining_manually_verified":"GLY_000493",
   "human_proteoform_glycosylation_sites_literature":"GLY_000143",
   "hcv1a_proteoform_glycosylation_sites_literature":"GLY_000335",
   "sarscov1_proteoform_glycosylation_sites_literature":"GLY_000510",
   "human_proteoform_glycosylation_sites_o_gluc":"GLY_000716",
   "glycan_biomarkers":"GLY_000737",
   "glycan_citations_ncfg":"GLY_000538",
   "glycan_citations_glytoucan":"GLY_000285",
   "glycan_synthesized":"GLY_000309"
}
kmartinez834 commented 5 months ago

We can't use ds2bco.json for UniCarbKB because Raja wants the evidence badge to say "UniCarbKB"

If we use protein_xref_glygen_ds/glycan_xref_glygen_ds as the xref_key in these datasets, the badge will say "GlyGen"

rykahsay commented 5 months ago

I want all dataset-2-bco mappings in ds2bco.json, please put them there.

I will make sure it will not say "protein_xref_glygen_ds/glycan_xref_glygen_ds" for UniCarbKB datasets

kmartinez834 commented 5 months ago

Ok great - I have updated ds2bco.json

rykahsay commented 5 months ago

I have modified these datasets now -- please check:

glycan_species.csv glycan_species_customized_neuac_neugc.csv glycan_citations_glytoucan.csv

human_proteoform_glycosylation_sites_unicarbkb.csv mouse_proteoform_glycosylation_sites_unicarbkb.csv rat_proteoform_glycosylation_sites_unicarbkb.csv sarscov1_proteoform_glycosylation_sites_unicarbkb.csv sarscov2_proteoform_glycosylation_sites_unicarbkb.csv human_proteoform_citations_glycosylation_sites_unicarbkb.csv mouse_proteoform_citations_glycosylation_sites_unicarbkb.csv rat_proteoform_citations_glycosylation_sites_unicarbkb.csv sarscov1_proteoform_citations_glycosylation_sites_unicarbkb.csv sarscov2_proteoform_citations_glycosylation_sites_unicarbkb.csv

human_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv rat_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv human_proteoform_citations_glycosylation_sites_unicarbkb_glycomics_study.csv rat_proteoform_citations_glycosylation_sites_unicarbkb_glycomics_study.csv

human_proteoform_glycosylation_sites_literature hcv1a_proteoform_glycosylation_sites_literature.csv sarscov1_proteoform_glycosylation_sites_literature.csv

hcv1a_proteoform_citations_glycosylation_sites_literature.csv human_proteoform_citations_glycosylation_sites_literature.csv sarscov1_proteoform_citations_glycosylation_sites_literature.csv

kmartinez834 commented 5 months ago

All of the datasets look good. The following were missing from ds2bco.json, so I just added them: "glycan_species":"GLY_000341", "glycan_species_customized_neuac_neugc":"GLY_000341"