monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
15 stars 2 forks source link

Resolve OMIM Nodes Issues where nodes have both disease and gene categories. #251

Closed putmantime closed 1 year ago

putmantime commented 2 years ago

In our OMIM Ingest are categorized as both gene and disease.

MATCH (n:biolink:Disease biolink:Gene) RETURN count(n) as count returns 19 nodes

MATCH (n:biolink:Disease biolink:Gene) RETURN n.id

n.id

"OMIM:107670" "OMIM:109690" "OMIM:124080" "OMIM:126452" "OMIM:139360" "OMIM:142830" "OMIM:147545" "OMIM:158105" "OMIM:162080" "OMIM:168820" "OMIM:176797" "OMIM:600020" "OMIM:600985" "OMIM:601130" "OMIM:601410" "OMIM:602686" "OMIM:603013" "OMIM:603615" "OMIM:613733"

putmantime commented 2 years ago

A little more digging with pandas turned up some more:

OMIM:100650 {'biolink:Gene', 'biolink:Disease'} OMIM:106150 {'biolink:Gene', 'biolink:Disease'} OMIM:106180 {'biolink:Gene', 'biolink:Disease'} OMIM:107670 {'biolink:Disease', 'biolink:Gene'} OMIM:109690 {'biolink:Disease', 'biolink:Gene'} OMIM:118425 {'biolink:Gene', 'biolink:Disease'} OMIM:120120 {'biolink:Gene', 'biolink:Disease'} OMIM:124080 {'biolink:Disease', 'biolink:Gene'} OMIM:126452 {'biolink:Disease', 'biolink:Gene'} OMIM:131240 {'biolink:Gene', 'biolink:Disease'} OMIM:134637 {'biolink:Gene', 'biolink:Disease'} OMIM:139320 {'biolink:Gene', 'biolink:Disease'} OMIM:139360 {'biolink:Disease', 'biolink:Gene'} OMIM:142830 {'biolink:Disease', 'biolink:Gene'} OMIM:147545 {'biolink:Disease', 'biolink:Gene'} OMIM:147575 {'biolink:Gene', 'biolink:Disease'} OMIM:152390 {'biolink:Gene', 'biolink:Disease'} OMIM:158105 {'biolink:Disease', 'biolink:Gene'} OMIM:162080 {'biolink:Disease', 'biolink:Gene'} OMIM:163729 {'biolink:Gene', 'biolink:Disease'} OMIM:168820 {'biolink:Disease', 'biolink:Gene'} OMIM:173360 {'biolink:Gene', 'biolink:Disease'} OMIM:173470 {'biolink:Gene', 'biolink:Disease'} OMIM:176797 {'biolink:Disease', 'biolink:Gene'} OMIM:176943 {'biolink:Gene', 'biolink:Disease'} OMIM:182100 {'biolink:Gene', 'biolink:Disease'} OMIM:188830 {'biolink:Gene', 'biolink:Disease'} OMIM:191160 {'biolink:Gene', 'biolink:Disease'} OMIM:191170 {'biolink:Gene', 'biolink:Disease'} OMIM:217050 {'biolink:Gene', 'biolink:Disease'} OMIM:300265 {'biolink:Gene', 'biolink:Disease'} OMIM:600020 {'biolink:Disease', 'biolink:Gene'} OMIM:600098 {'biolink:Gene', 'biolink:Disease'} OMIM:600700 {'biolink:Gene', 'biolink:Disease'} OMIM:600985 {'biolink:Disease', 'biolink:Gene'} OMIM:601130 {'biolink:Disease', 'biolink:Gene'} OMIM:601373 {'biolink:Gene', 'biolink:Disease'} OMIM:601410 {'biolink:Disease', 'biolink:Gene'} OMIM:601465 {'biolink:Gene', 'biolink:Disease'} OMIM:602421 {'biolink:Gene', 'biolink:Disease'} OMIM:602686 {'biolink:Disease', 'biolink:Gene'} OMIM:603013 {'biolink:Disease', 'biolink:Gene'} OMIM:603324 {'biolink:Gene', 'biolink:Disease'} OMIM:603372 {'biolink:Gene', 'biolink:Disease'} OMIM:603517 {'biolink:Gene', 'biolink:Disease'} OMIM:603615 {'biolink:Disease', 'biolink:Gene'} OMIM:604124 {'biolink:Gene', 'biolink:Disease'} OMIM:605204 {'biolink:Gene', 'biolink:Disease'} OMIM:606989 {'biolink:Gene', 'biolink:Disease'} OMIM:607093 {'biolink:Gene', 'biolink:Disease'} OMIM:607585 {'biolink:Gene', 'biolink:Disease'} OMIM:607759 {'biolink:Gene', 'biolink:Disease'} OMIM:608537 {'biolink:Gene', 'biolink:Disease'} OMIM:613733 {'biolink:Disease', 'biolink:Gene'} OMIM:615538 {'biolink:Gene', 'biolink:Disease'} OMIM:616902 {'biolink:Gene', 'biolink:Disease'} OMIM:617352 {'biolink:Gene', 'biolink:Disease'}

RichardBruskiewich commented 2 years ago

@putmantime, I assume you want us to call a spade a spade: if an OMIM is classified as a Gene, then just call it a Gene, ignoring any other additional classification (as a "disease")?

The ingest uses mim2gene mappings which have a column indicating when a MIM id is considered a Gene. I suppose using these mappings to assert "Gene" in an overriding fashion would likely resolve the issue. Unless I hear otherwise, I'll proceed forward with that understanding.

The slight complication here is that the OMIM gene_to_phenotype.py ingest doesn't seem to actually upload the Gene nodes themselves. I'll have to clarify where the gene nodes are being loaded into the graph (@kevinschaper... they must be loaded separately, somewhere?)

kevinschaper commented 2 years ago

I removed the gene node creation (last week I think?). For OMIM genes, we really want to have an edge only ingest here and wire the edges to HGNC (and MONDO). I'm less sure about what we need to do to capture the nodes that were being captured as NucleicAcidEntity / heritable_phenotypic_marker - but we want real nodes for them with more than just an ID, which ideally would happen it's own ingest, assuming that we won't have other nodes for them that we can map to.

RichardBruskiewich commented 2 years ago

We can easily ensure that the OMIM ingest only outputs Gene subjects. On the HGNC mappings, OMIM mim2gene only provides the HGNC gene symbols. Do we have any map available of HGNC gene symbol to HGNC ID around?

RichardBruskiewich commented 2 years ago

@putmantime. @kevinschaper things that this issue is a red herring. I've reassigned it to you and him for (re-)discussion and closure and/or clarification. I'll also put it in the 'Icebox' of zenhub tracking.

kevinschaper commented 2 years ago

My best guess is that this is resolved by making OMIM an edge only ingest. I think we likely need to add a node ingest to pick up everything that can't be mapped to HGNC, but we can take care of that in #255

kevinschaper commented 1 year ago

This same problem manifested itself in using our new mapping strategy as well, but this case is safe to close.