Closed putmantime closed 2 years ago
Hi @kevinschaper and @putmantime, the 'EnsemblGenome' namespace is a weird on: basically seems to have identifier values which are simply taxon specific gene identifiers. I just quickly compared one such entry (for dictyBase) and see it may be the case that the gene identifiers indicated, although identical in (taxon specific) format, appear may be disjoint sets. I'm not sure what that means in practice... is the species-specific database properly curated and are the Ensemble genome ones only just predicated gene loci (which seem to have orthologs to other genomes)?
I could, as a first approximation, with some tricky coding, remap these identifiers onto their individual taxonomic namespaces.. I'm just wondering how problematic, scientifically, that would be (i.e. from an end user data quality perspective)
I'll poke around a bit more with this issue, while I await your feedback...
Here's one example:
DANRE|ZFIN=ZDB-GENE-090112-5|UniProtKB=E9QCN7 DICDI|EnsemblGenome=DDB_G0277073|UniProtKB=Q550K4 O Unikonts PTHR21324
The corresponding Uniprot entries are:
https://www.uniprot.org/uniprot/E9QCN7 and https://www.uniprot.org/uniprot/Q550K4
Note that these are "predicated proteins" but interestingly, there seems to be some expression support for it (in Zebrafish: https://bgee.org/gene/ENSDARG00000069590)
The panther ingest has genes with the following namespaces. {'ENSEMBL', 'EnsemblGenome', 'FB', 'Gene', 'GeneID', 'Gene_ORFName', 'Gene_OrderedLocusName', 'MGI', 'PomBase', 'RGD', 'SGD', 'WormBase', 'ZFIN', 'dictyBase'}
"Gene" seems to be the gene symbol "GeneID" seems to be Entrez Gene ID "Gene_ORFName" is the gene orf name from a transcript in Uniprot "Gene_OrderedLocusName" Is a gene ordered locus name
We need to map the Prefixes to their canonical prefix if possible and identify namespaces to omit. Monarch preferred namespaces: https://docs.google.com/spreadsheets/d/1XrljI1Dk2Tg0teJSbQls5Iq_KCGXMCdyWvqAMfMgJ-M/edit#gid=136453094