monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
14 stars 1 forks source link

Filter out unmapped genomic entities in Panther ingest by namespace #244

Closed putmantime closed 2 years ago

putmantime commented 2 years ago

The panther ingest has genes with the following namespaces. {'ENSEMBL', 'EnsemblGenome', 'FB', 'Gene', 'GeneID', 'Gene_ORFName', 'Gene_OrderedLocusName', 'MGI', 'PomBase', 'RGD', 'SGD', 'WormBase', 'ZFIN', 'dictyBase'}

"Gene" seems to be the gene symbol "GeneID" seems to be Entrez Gene ID "Gene_ORFName" is the gene orf name from a transcript in Uniprot "Gene_OrderedLocusName" Is a gene ordered locus name

We need to map the Prefixes to their canonical prefix if possible and identify namespaces to omit. Monarch preferred namespaces: https://docs.google.com/spreadsheets/d/1XrljI1Dk2Tg0teJSbQls5Iq_KCGXMCdyWvqAMfMgJ-M/edit#gid=136453094

RichardBruskiewich commented 2 years ago

Hi @kevinschaper and @putmantime, the 'EnsemblGenome' namespace is a weird on: basically seems to have identifier values which are simply taxon specific gene identifiers. I just quickly compared one such entry (for dictyBase) and see it may be the case that the gene identifiers indicated, although identical in (taxon specific) format, appear may be disjoint sets. I'm not sure what that means in practice... is the species-specific database properly curated and are the Ensemble genome ones only just predicated gene loci (which seem to have orthologs to other genomes)?

I could, as a first approximation, with some tricky coding, remap these identifiers onto their individual taxonomic namespaces.. I'm just wondering how problematic, scientifically, that would be (i.e. from an end user data quality perspective)

I'll poke around a bit more with this issue, while I await your feedback...

Here's one example:

DANRE|ZFIN=ZDB-GENE-090112-5|UniProtKB=E9QCN7   DICDI|EnsemblGenome=DDB_G0277073|UniProtKB=Q550K4       O       Unikonts        PTHR21324

The corresponding Uniprot entries are:

https://www.uniprot.org/uniprot/E9QCN7 and https://www.uniprot.org/uniprot/Q550K4

Note that these are "predicated proteins" but interestingly, there seems to be some expression support for it (in Zebrafish: https://bgee.org/gene/ENSDARG00000069590)