Closed kevinschaper closed 2 years ago
The current file contains ~100 coronavirus entries and ~350,000 human entries, which we can easily filter out.
The curie prefixes we are interested in have records shown below: | Curie | GCR | BGI | Difference |
---|---|---|---|---|
MGI: | 264,550 | 72,543 | + 192,007 | |
RGD: | 184,963 | 66,866 | + 118,097 | |
FB: | 86,592 | 17,647 | + 68,945 | |
WB: | 245,377 | 46,926 | + 198,451 | |
ZFIN: | 150,526 | 30,551 | + 119,975 | |
Xenbase: | 165,656 | 38,492 | + 127164 |
These are big changes and I'm not really sure what is going on. I've pushed my code to a branch and draft PR so i can spend some time thinking about what is going on here. Please share any thoughts you might have.
@amc-corey-cox Thanks for checking the counts! Can you expand that table to have both curies? that might tell us what the extras look like
We want to exclude the UniProtKB ID mappings from Alliance.
I brought the tsv output in and took the bgi/json mappings out. I left UniProtKB IDs in, with the thought that I'd rather not pull all of the existing UniProtKB mappings out until we add them all in as a part of #3.
I like these numbers:
taxon | subject_prefix | object_prefix | total |
---|---|---|---|
NCBITaxon:10090 | MGI: | ENSEMBL: | 56685 |
NCBITaxon:10090 | MGI: | NCBI_Gene: | 59640 |
NCBITaxon:10090 | MGI: | UniProtKB: | 79208 |
NCBITaxon:10116 | RGD: | ENSEMBL: | 43635 |
NCBITaxon:10116 | RGD: | NCBI_Gene: | 53372 |
NCBITaxon:10116 | RGD: | UniProtKB: | 38422 |
NCBITaxon:6239 | WB: | ENSEMBL: | 46926 |
NCBITaxon:6239 | WB: | NCBI_Gene: | 46886 |
NCBITaxon:6239 | WB: | UniProtKB: | 26637 |
NCBITaxon:7227 | FB: | NCBI_Gene: | 17632 |
NCBITaxon:7227 | FB: | UniProtKB: | 27663 |
NCBITaxon:7955 | ZFIN: | ENSEMBL: | 27211 |
NCBITaxon:7955 | ZFIN: | NCBI_Gene: | 23260 |
NCBITaxon:7955 | ZFIN: | UniProtKB: | 57307 |
NCBITaxon:8355 | Xenbase: | NCBI_Gene: | 22831 |
NCBITaxon:8355 | Xenbase: | UniProtKB: | 24132 |
NCBITaxon:8364 | Xenbase: | ENSEMBL: | 12580 |
NCBITaxon:8364 | Xenbase: | NCBI_Gene: | 15631 |
NCBITaxon:8364 | Xenbase: | UniProtKB: | 28124 |
NCBITaxon:9031 | NCBIGene: | ENSEMBL: | 15577 |
NCBITaxon:9606 | HGNC: | ENSEMBL: | 43231 |
NCBITaxon:9606 | HGNC: | NCBIGene: | 43231 |
NCBITaxon:9606 | HGNC: | OMIM: | 43243 |
NCBITaxon:9606 | HGNC: | UniProtKB: | 43312 |
NCBITaxon:9615 | NCBIGene: | ENSEMBL: | 19927 |
NCBITaxon:9823 | NCBIGene: | ENSEMBL: | 17783 |
NCBITaxon:9913 | NCBIGene: | ENSEMBL: | 20403 |
Alliance has a cross reference file for genes that we can use which is a more compact format than the larger BGI files.
https://fms.alliancegenome.org/download/GENECROSSREFERENCE_COMBINED.tsv.gz
We should filter TaxonID to exclude human and sars-cov-2. (we don't have cars-cov-2 genes, and we'll get our human mapping from HGNC)
We'll also only want gene mappings when the GeneID column is a prefix we're using for Alliance genes, meaning just the mod identifiers (MGI:,RGD:,FB:,WB:,ZFIN:,Xenbase:)
We'll also want to exclude self to self cross references that are in there for different reasons (curie expansion to a gene expression or publication page on a mod site, for example).
It will be interesting in the course of making a change to see how this differs from the mapping that we create from BGI files and we might need to ask questions of Alliance folks if there's a big difference.