monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
15 stars 2 forks source link

Set taxon ID for HGNC & SGD genes #347

Closed kevinschaper closed 10 months ago

kevinschaper commented 2 years ago

I was looking at the ID prefix & taxon ID for all of our genes, and realized that we're missing the in_taxon field for human and yeast genes

category in_taxon prefix
biolink:Gene NCBITaxon:10090 MGI:
biolink:Gene NCBITaxon:10116 RGD:
biolink:Gene NCBITaxon:227321 NCBIGene:
biolink:Gene NCBITaxon:4896 PomBase:
biolink:Gene NCBITaxon:559292 SGD:
biolink:Gene NCBITaxon:6239 WB:
biolink:Gene NCBITaxon:7227 FB:
biolink:Gene NCBITaxon:7955 ZFIN:
biolink:Gene NCBITaxon:9031 NCBIGene:
biolink:Gene NCBITaxon:9606 HGNC:
biolink:Gene NCBITaxon:9615 NCBIGene:
biolink:Gene NCBITaxon:9823 NCBIGene:
biolink:Gene NCBITaxon:9913 NCBIGene:
biolink:Gene None SGD:
biolink:Gene None HGNC:
kevinschaper commented 2 years ago

This is strange.

In alliance_gene_nodes.tsv and hgnc_gene_nodes.tsv I don't see anything with a missing taxon.

After the merge, I see just 3 genes with no taxon, which appear to have come from the same files:

"select distinct category, in_taxon, id, provided_by from output/monarch-kg_nodes.tsv where category = 'biolink:Gene' and in_taxon not like 'NCBI%' order by 1, 2"
category        in_taxon    id              provided_by
biolink:Gene    None        SGD:S000004416  output/transform_output/alliance_gene_nodes.tsv
biolink:Gene    None        HGNC:21060      output/transform_output/hgnc_gene_nodes.tsv
biolink:Gene    None        HGNC:40992      output/transform_output/hgnc_gene_nodes.tsv
kevinschaper commented 2 years ago

Removing my assignment, because this doesn't feel urgent...even though it's very spooky.

kevinschaper commented 1 year ago

Update: here is the current list, still spooky

sqlite3 -markdown monarch-kg.db "select distinct category, in_taxon, id, provided_by from nodes where category = 'biolink:Gene' and in_taxon not like 'NCBI%' order by 1, 2" category in_taxon id provided_by
biolink:Gene HGNC:21060 hgnc_gene_nodes
biolink:Gene HGNC:40992 hgnc_gene_nodes
biolink:Gene WB:WBGene00006857 alliance_gene_nodes
biolink:Gene WB:WBGene00006858 alliance_gene_nodes
biolink:Gene WB:WBGene00006859 alliance_gene_nodes
biolink:Gene WB:WBGene00006860 alliance_gene_nodes
biolink:Gene SGD:S000004416 alliance_gene_nodes
biolink:Gene RGD:1306669 alliance_gene_nodes
biolink:Gene RGD:1308036 alliance_gene_nodes
biolink:Gene RGD:70947 alliance_gene_nodes
biolink:Gene RGD:1585231 alliance_gene_nodes
biolink:Gene RGD:1307632 alliance_gene_nodes
biolink:Gene RGD:1309993 alliance_gene_nodes
biolink:Gene RGD:1359631 alliance_gene_nodes
biolink:Gene MGI:1334416 alliance_gene_nodes
biolink:Gene MGI:1334417 alliance_gene_nodes
biolink:Gene MGI:1342270 alliance_gene_nodes
biolink:Gene MGI:1860417 alliance_gene_nodes
biolink:Gene FB:FBgn0004598 alliance_gene_nodes
biolink:Gene FB:FBgn0015371 alliance_gene_nodes
biolink:Gene FB:FBgn0015931 alliance_gene_nodes
biolink:Gene FB:FBgn0015932 alliance_gene_nodes
biolink:Gene FB:FBgn0019928 alliance_gene_nodes
biolink:Gene FB:FBgn0019929 alliance_gene_nodes
biolink:Gene FB:FBgn0020828 alliance_gene_nodes
biolink:Gene FB:FBgn0020831 alliance_gene_nodes
biolink:Gene FB:FBgn0020850 alliance_gene_nodes
biolink:Gene FB:FBgn0023179 alliance_gene_nodes
biolink:Gene FB:FBgn0025343 alliance_gene_nodes
biolink:Gene FB:FBgn0025608 alliance_gene_nodes
biolink:Gene FB:FBgn0026616 alliance_gene_nodes
biolink:Gene FB:FBgn0027588 alliance_gene_nodes
biolink:Gene FB:FBgn0027661 alliance_gene_nodes
biolink:Gene FB:FBgn0029688 alliance_gene_nodes
biolink:Gene FB:FBgn0044423 alliance_gene_nodes
biolink:Gene FB:FBgn0044424 alliance_gene_nodes
biolink:Gene FB:FBgn0044425 alliance_gene_nodes
biolink:Gene FB:FBgn0044426 alliance_gene_nodes
biolink:Gene FB:FBgn0062518 alliance_gene_nodes
biolink:Gene FB:FBgn0070051 alliance_gene_nodes
biolink:Gene FB:FBgn0070056 alliance_gene_nodes
biolink:Gene FB:FBgn0070057 alliance_gene_nodes
biolink:Gene FB:FBgn0283652 alliance_gene_nodes
biolink:Gene FB:FBgn0285970 alliance_gene_nodes
monicacecilia commented 10 months ago

@kevinschaper 👻 .... spooky, though fixed? or spooky and still unsolved?

kevinschaper commented 10 months ago

It looks like it's solved!

the same query now returns phenio cruft instead

category in_taxon id provided_by
biolink:Gene SIO:010035 phenio_nodes
biolink:Gene DATACOMMONS:Gene phenio_nodes

I'll close this, and open a new issue for limiting our phenio nodes by category