Review kg-phenio ID prefix subset

kevinschaper commented 1 year ago

We're filtering out nodes and edges from kg-phenio by prefix currently:

https://github.com/monarch-initiative/monarch-ingest/blob/40020ec11b892d929632aa1b98b529b87511bd32/src/monarch_ingest/cli_utils.py#LL158C1-L161C1

    prefixes = ["MONDO", "OMIM", "HP", "ZP", "MP", "CHEBI", "FBbt",
                "FYPO", "WBPhenotype", "GO", "MESH", "XPO",
                "ZFA", "UBERON", "WBbt", "ORPHA", "EMAPA"]

I created this Iist initially to include ontology prefixes that we use in ingests. I realized recently that we don't have PATO terms in the graph, which makes me think that my list is filtering too aggressively and there are probably ontologies that provide connections within phenio that are important to have in the graph

Do you have feedback @cmungall @matentzn?

caufieldjh commented 1 year ago

It may not make an immense impact - I see 1,661 edges in kg-phenio involving PATO which aren't subclass_of or those redundant category edges. That set includes 504 different PATO terms, most of them participating in related_to edges with UPHENO, UBERON, FBbt and MONDO. For many (most? nearly all?) PATO terms, like PATO:0000389 (acute) or PATO:0000634 (unilateral), it's not enabling a path to exist between these or other ontologies, it's just acting more like a qualifier.

matentzn commented 1 year ago

@kevinschaper - In order to tell you more, I need to know why you subset at all. There are a number of problems with this, including

many Species-specific vocabs that are missing (XAO, FBcv, DPO)
Mainting a hard coded list of prefixes creates another point of failure that needs to be maintained if new sources are added (instead, create a phenio-monarch version that only contains the terms you want)
Maintaining closure and links across branches when you "remove" links from a KG. This is a highly complex issue.

In generally, I would recommend to request a Phenio subset on the phenio issue tracker, describe the characteristics you need and why, and integrated that instead of the whole thing.

kevinschaper commented 1 year ago

@matentzn That makes a lot of sense. I think the reason that I did this filtering initially was that I had errors because there were gene nodes coming in, and when I looked further it seemed like an opt-in list was more practical than an opt-out list.

If I remove filtering, here are prefixes w/counts

36602 ZP
25990 MONDO
20061 XPO
19918 FBbt
19632 UPHENO
16954 HP
15719 UBERON
14049 MP
10458 GO
8745 EMAPA
7556 WBbt
6506 FMA
5158 OBA
3683 CHEBI
3382 HGNC
3218 ZFA
3072 MA
2695 WBPhenotype
1634 NCBITaxon
1622 CL
1605 XAO
 763 PR
 666 biolink
 597 PATO
 501 RO
 299 NBO
 237 FlyBase
 232 HSAPDV
 203 CHR
 115 FAO
 103 STY
 103 OBO
  80 BSPO
  61 IAO
  60 PO
  52 MPATH
  40 https
  39 SO
  38 ZFS
  35 ECO
  33 STATO
  33 OBI
  31 WD_Entity
  31 UBPROP
  31 NCIT
  30 BFO
  28 TS
  27 SIO
  21 ENVO
  17 ECTO
  10 http
  10 OIO
   9 LINKML
   9 CARO
   7 dc
   6 MFOMD
   6 GENO
   4 foaf
   4 MF
   3 rdfs
   3 owl
   3 UMLS
   3 NIF.EXT
   2 dcterms
   2 PROV
   2 OMO
   2 OGMS
   2 MESH
   2 MAXO
   1 dctypes
   1 dcat
   1 WBLS
   1 WBBT
   1 TO
   1 SNOMEDCT
   1 SEPIO
   1 RNORDV
   1 PW
   1 PHENIO
   1 PCO
   1 Orphanet
   1 NLX.SUB
   1 NLX.OEN
   1 NIF.STD
   1 FYPO
   1 FOODON
   1 FBcv
   1 DOID
   1 CLO
   1 CIO
   1 APO

I'm definitely getting genes from HGNC, FlyBase (used to be MGI, but they look like they're gone now). I'm getting nodes for biolink classes, predicates and even enum permissible values. Should I have 1 Orphanet ID? 2 MESH IDs? 6 GENO IDs?

Do I want singleton nodes for rdfs:isDefinedBy, rdfs:label, rdfs:seeAlso?

I don't feel confident in my strategy from either direction, but I think initially opt-in was just less of a time sink.

kevinschaper commented 1 year ago

I just noticed that I have these LinkML nodes (also singletons):

LINKML:Boolean  biolink:NamedThing
LINKML:Date     biolink:NamedThing
LINKML:Double   biolink:NamedThing
LINKML:Float    biolink:NamedThing
LINKML:Integer  biolink:NamedThing
LINKML:String   biolink:NamedThing
LINKML:Time     biolink:NamedThing
LINKML:Uriorcurie       biolink:NamedThing
LINKML:mixin    biolink:NamedThing

caufieldjh commented 1 year ago

Oh, those are definitely coming in through the Biolink merge, since BL imports them. They can be omitted during the Phenio build.

matentzn commented 1 year ago

@kevinschaper what you are observing can be answered only on the phenio level - any weird ID can come in through imports, we do not really control the prefix space in PHENIO at all. ODK does come with a way though to drop specific prefixes from the pipeline!

kevinschaper commented 1 year ago

I'm doing an interactive kg build and starting looking at phenio filtering, and I'm going to reverse my include list, and go to this very short exclude list:

    exclude_prefixes = [
        "HGNC",
        "FlyBase",
        "http",
        "biolink"
    ]

I'll also write out a file in qc for what was excluded by prefix.

matentzn commented 1 year ago

Why are you excluding HGNC for example? are you not loosing some Mondo->HGNC links this way?

monarch-initiative / monarch-ingest

Review kg-phenio ID prefix subset #468