monarch-initiative / monarch-gene-mapping

Code for mapping source namespaces to preffered namespacing
2 stars 0 forks source link

Switch from BGI files to GENECROSSREFERENCE file for Alliance genes #12

Closed kevinschaper closed 1 year ago

kevinschaper commented 1 year ago

Alliance has a cross reference file for genes that we can use which is a more compact format than the larger BGI files.

https://fms.alliancegenome.org/download/GENECROSSREFERENCE_COMBINED.tsv.gz

##########################################################################
#
# Data type: Gene Cross Reference
# Data format: tsv
# README:
# Source: Alliance of Genome Resources (Alliance)
# Source URL: http://alliancegenome.org/downloads
# Help Desk: help@alliancegenome.org
# Taxon IDs: NCBITaxon:9606, NCBITaxon:10116, NCBITaxon:10090, NCBITaxon:7955, NCBITaxon:8364, NCBITaxon:8355, NCBITaxon:7227, NCBITaxon:6239, NCBITaxon:559292, NCBITaxon:2697049
# Species: Homo sapiens, Rattus norvegicus, Mus musculus, Danio rerio, Xenopus tropicalis, Xenopus laevis, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, SARS-CoV-2
# Alliance Database Version: 5.3.0
# Date file generated (UTC): 2022-09-26 23:57
#
##########################################################################
GeneID  GlobalCrossReferenceID  CrossReferenceCompleteURL   ResourceDescriptorPage  TaxonID
RefSeq:YP_009725301 RefSeq:YP_009725301 https://www.ncbi.nlm.nih.gov/nuccore/YP_009725301   generic_cross_reference NCBITaxon:2697049
RefSeq:YP_009725301 RefSeq:YP_009742612 https://www.ncbi.nlm.nih.gov/nuccore/YP_009742612   generic_cross_reference NCBITaxon:2697049
RefSeq:YP_009725297 NCBI_Gene:43740578  https://www.ncbi.nlm.nih.gov/gene/43740578  generic_cross_reference NCBITaxon:2697049
RefSeq:YP_009725308 NCBI_Gene:43740578  https://www.ncbi.nlm.nih.gov/gene/43740578  generic_cross_reference NCBITaxon:2697049
RefSeq:YP_009725300 NCBI_Gene:43740578  https://www.ncbi.nlm.nih.gov/gene/43740578  generic_cross_reference NCBITaxon:2697049

We should filter TaxonID to exclude human and sars-cov-2. (we don't have cars-cov-2 genes, and we'll get our human mapping from HGNC)

We'll also only want gene mappings when the GeneID column is a prefix we're using for Alliance genes, meaning just the mod identifiers (MGI:,RGD:,FB:,WB:,ZFIN:,Xenbase:)

We'll also want to exclude self to self cross references that are in there for different reasons (curie expansion to a gene expression or publication page on a mod site, for example).

It will be interesting in the course of making a change to see how this differs from the mapping that we create from BGI files and we might need to ask questions of Alliance folks if there's a big difference.

amc-corey-cox commented 1 year ago

The current file contains ~100 coronavirus entries and ~350,000 human entries, which we can easily filter out.

The curie prefixes we are interested in have records shown below: Curie GCR BGI Difference
MGI: 264,550 72,543 + 192,007
RGD: 184,963 66,866 + 118,097
FB: 86,592 17,647 + 68,945
WB: 245,377 46,926 + 198,451
ZFIN: 150,526 30,551 + 119,975
Xenbase: 165,656 38,492 + 127164

These are big changes and I'm not really sure what is going on. I've pushed my code to a branch and draft PR so i can spend some time thinking about what is going on here. Please share any thoughts you might have.

kevinschaper commented 1 year ago

@amc-corey-cox Thanks for checking the counts! Can you expand that table to have both curies? that might tell us what the extras look like

amc-corey-cox commented 1 year ago

We want to exclude the UniProtKB ID mappings from Alliance.

kevinschaper commented 1 year ago

I brought the tsv output in and took the bgi/json mappings out. I left UniProtKB IDs in, with the thought that I'd rather not pull all of the existing UniProtKB mappings out until we add them all in as a part of #3.

I like these numbers:

taxon subject_prefix object_prefix total
NCBITaxon:10090 MGI: ENSEMBL: 56685
NCBITaxon:10090 MGI: NCBI_Gene: 59640
NCBITaxon:10090 MGI: UniProtKB: 79208
NCBITaxon:10116 RGD: ENSEMBL: 43635
NCBITaxon:10116 RGD: NCBI_Gene: 53372
NCBITaxon:10116 RGD: UniProtKB: 38422
NCBITaxon:6239 WB: ENSEMBL: 46926
NCBITaxon:6239 WB: NCBI_Gene: 46886
NCBITaxon:6239 WB: UniProtKB: 26637
NCBITaxon:7227 FB: NCBI_Gene: 17632
NCBITaxon:7227 FB: UniProtKB: 27663
NCBITaxon:7955 ZFIN: ENSEMBL: 27211
NCBITaxon:7955 ZFIN: NCBI_Gene: 23260
NCBITaxon:7955 ZFIN: UniProtKB: 57307
NCBITaxon:8355 Xenbase: NCBI_Gene: 22831
NCBITaxon:8355 Xenbase: UniProtKB: 24132
NCBITaxon:8364 Xenbase: ENSEMBL: 12580
NCBITaxon:8364 Xenbase: NCBI_Gene: 15631
NCBITaxon:8364 Xenbase: UniProtKB: 28124
NCBITaxon:9031 NCBIGene: ENSEMBL: 15577
NCBITaxon:9606 HGNC: ENSEMBL: 43231
NCBITaxon:9606 HGNC: NCBIGene: 43231
NCBITaxon:9606 HGNC: OMIM: 43243
NCBITaxon:9606 HGNC: UniProtKB: 43312
NCBITaxon:9615 NCBIGene: ENSEMBL: 19927
NCBITaxon:9823 NCBIGene: ENSEMBL: 17783
NCBITaxon:9913 NCBIGene: ENSEMBL: 20403