Closed kmartinez834 closed 5 months ago
@rykahsay I have finished documenting all of the UniCarbKB xref changes above
Why are you not using generated/misc/ds2bco.json to put the mapping between dataset filename and BCOID? Please re-write the instructions and make sure you have placed the mappings in ds2bco.json file.
$ cat generated/misc/ds2bco.json
{
"human_protein_biomarkers":"GLY_000625",
"human_proteoform_glycosylation_sites_embl":"GLY_000888",
"mouse_proteoform_glycosylation_sites_embl":"GLY_000889",
"human_proteoform_glycosylation_sites_diabetes_glycomic":"GLY_000960",
"human_proteoform_glycosylation_sites_literature_mining":"GLY_000481",
"mouse_proteoform_glycosylation_sites_literature_mining":"GLY_000492",
"rat_proteoform_glycosylation_sites_literature_mining":"GLY_000493",
"human_proteoform_glycosylation_sites_literature_mining_manually_verified":"GLY_000481",
"mouse_proteoform_glycosylation_sites_literature_mining_manually_verified":"GLY_000492",
"rat_proteoform_glycosylation_sites_literature_mining_manually_verified":"GLY_000493",
"human_proteoform_glycosylation_sites_literature":"GLY_000143",
"hcv1a_proteoform_glycosylation_sites_literature":"GLY_000335",
"sarscov1_proteoform_glycosylation_sites_literature":"GLY_000510",
"human_proteoform_glycosylation_sites_o_gluc":"GLY_000716",
"glycan_biomarkers":"GLY_000737",
"glycan_citations_ncfg":"GLY_000538",
"glycan_citations_glytoucan":"GLY_000285",
"glycan_synthesized":"GLY_000309"
}
We can't use ds2bco.json for UniCarbKB because Raja wants the evidence badge to say "UniCarbKB"
If we use protein_xref_glygen_ds/glycan_xref_glygen_ds as the xref_key in these datasets, the badge will say "GlyGen"
I want all dataset-2-bco mappings in ds2bco.json, please put them there.
I will make sure it will not say "protein_xref_glygen_ds/glycan_xref_glygen_ds" for UniCarbKB datasets
Ok great - I have updated ds2bco.json
I have modified these datasets now -- please check:
glycan_species.csv glycan_species_customized_neuac_neugc.csv glycan_citations_glytoucan.csv
human_proteoform_glycosylation_sites_unicarbkb.csv mouse_proteoform_glycosylation_sites_unicarbkb.csv rat_proteoform_glycosylation_sites_unicarbkb.csv sarscov1_proteoform_glycosylation_sites_unicarbkb.csv sarscov2_proteoform_glycosylation_sites_unicarbkb.csv human_proteoform_citations_glycosylation_sites_unicarbkb.csv mouse_proteoform_citations_glycosylation_sites_unicarbkb.csv rat_proteoform_citations_glycosylation_sites_unicarbkb.csv sarscov1_proteoform_citations_glycosylation_sites_unicarbkb.csv sarscov2_proteoform_citations_glycosylation_sites_unicarbkb.csv
human_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv rat_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv human_proteoform_citations_glycosylation_sites_unicarbkb_glycomics_study.csv rat_proteoform_citations_glycosylation_sites_unicarbkb_glycomics_study.csv
human_proteoform_glycosylation_sites_literature hcv1a_proteoform_glycosylation_sites_literature.csv sarscov1_proteoform_glycosylation_sites_literature.csv
hcv1a_proteoform_citations_glycosylation_sites_literature.csv human_proteoform_citations_glycosylation_sites_literature.csv sarscov1_proteoform_citations_glycosylation_sites_literature.csv
All of the datasets look good. The following were missing from ds2bco.json, so I just added them: "glycan_species":"GLY_000341", "glycan_species_customized_neuac_neugc":"GLY_000341"
Per Raja, UniCarbKB evidence badges should say "UniCarbKB" but point to GlyGen datasets. I updated the xref_info.csv file so there are now four unicarbkb xref id's: Two are disabled so we can display in the "Cross References" section without a link, and two point to datasets for evidence badges.
Please make the following changes:
1. *_proteoform_glycosylation_sites_unicarbkb.csv and *_proteoform_glycosylation_sites_unicarbkb_glycomics_study.csv:
"src_xref_key"
from"protein_xref_unicarbkb"
to"protein_xref_unicarbkb_ds"
"src_xref_id"
with species-specific GlyGen dataset:2. glycan_species.csv
"source" == "UniCarbKB"
, make"src_key" = "glycan_xref_unicarbkb_ds"
3. glycan_species_customized_neuac_neugc.csv:
"xref_key"
and"xref_id"
from source filemake output rows look like this ---->
$ grep G05528SJ.*UniCarbKB unreviewed/glycan_species_customized_neuac_neugc.csv "G05528SJ","2697049","Severe acute respiratory syndrome coronavirus 2","Direct","UniCarbKB","comp_HexNAc11Hex11dHex0NeuAc1NeuGc0Pent0S0P0KDN0HexA0","comp_HexNAc11Hex11dHex0NeuAc1NeuGc0Pent0S0P0KDN0HexA0","glycan_xref_unicarbkb_ds","GLY_000341","False"
make output rows look like this ---->
$ awk -F, '{print $6,$7,$8,$9}' unreviewed/hcv1a_proteoform_glycosylation_sites_literature.csv | head -3 "xref_key" "xref_id" "src_xref_key" "src_xref_id" "protein_xref_pubmed" "18187336" "protein_xref_glygen_ds" "GLY_000335" "protein_xref_unicarbkb_ds" "GLY_000335" "protein_xref_glygen_ds" "GLY_000335"