Open kevinschaper opened 1 year ago
It may not make an immense impact - I see 1,661 edges in kg-phenio involving PATO which aren't subclass_of or those redundant category edges. That set includes 504 different PATO terms, most of them participating in related_to edges with UPHENO, UBERON, FBbt and MONDO. For many (most? nearly all?) PATO terms, like PATO:0000389 (acute) or PATO:0000634 (unilateral), it's not enabling a path to exist between these or other ontologies, it's just acting more like a qualifier.
@kevinschaper - In order to tell you more, I need to know why you subset at all. There are a number of problems with this, including
In generally, I would recommend to request a Phenio subset on the phenio issue tracker, describe the characteristics you need and why, and integrated that instead of the whole thing.
@matentzn That makes a lot of sense. I think the reason that I did this filtering initially was that I had errors because there were gene nodes coming in, and when I looked further it seemed like an opt-in list was more practical than an opt-out list.
If I remove filtering, here are prefixes w/counts
36602 ZP
25990 MONDO
20061 XPO
19918 FBbt
19632 UPHENO
16954 HP
15719 UBERON
14049 MP
10458 GO
8745 EMAPA
7556 WBbt
6506 FMA
5158 OBA
3683 CHEBI
3382 HGNC
3218 ZFA
3072 MA
2695 WBPhenotype
1634 NCBITaxon
1622 CL
1605 XAO
763 PR
666 biolink
597 PATO
501 RO
299 NBO
237 FlyBase
232 HSAPDV
203 CHR
115 FAO
103 STY
103 OBO
80 BSPO
61 IAO
60 PO
52 MPATH
40 https
39 SO
38 ZFS
35 ECO
33 STATO
33 OBI
31 WD_Entity
31 UBPROP
31 NCIT
30 BFO
28 TS
27 SIO
21 ENVO
17 ECTO
10 http
10 OIO
9 LINKML
9 CARO
7 dc
6 MFOMD
6 GENO
4 foaf
4 MF
3 rdfs
3 owl
3 UMLS
3 NIF.EXT
2 dcterms
2 PROV
2 OMO
2 OGMS
2 MESH
2 MAXO
1 dctypes
1 dcat
1 WBLS
1 WBBT
1 TO
1 SNOMEDCT
1 SEPIO
1 RNORDV
1 PW
1 PHENIO
1 PCO
1 Orphanet
1 NLX.SUB
1 NLX.OEN
1 NIF.STD
1 FYPO
1 FOODON
1 FBcv
1 DOID
1 CLO
1 CIO
1 APO
I'm definitely getting genes from HGNC, FlyBase (used to be MGI, but they look like they're gone now). I'm getting nodes for biolink classes, predicates and even enum permissible values. Should I have 1 Orphanet ID? 2 MESH IDs? 6 GENO IDs?
Do I want singleton nodes for rdfs:isDefinedBy
, rdfs:label
, rdfs:seeAlso
?
I don't feel confident in my strategy from either direction, but I think initially opt-in was just less of a time sink.
I just noticed that I have these LinkML nodes (also singletons):
LINKML:Boolean biolink:NamedThing
LINKML:Date biolink:NamedThing
LINKML:Double biolink:NamedThing
LINKML:Float biolink:NamedThing
LINKML:Integer biolink:NamedThing
LINKML:String biolink:NamedThing
LINKML:Time biolink:NamedThing
LINKML:Uriorcurie biolink:NamedThing
LINKML:mixin biolink:NamedThing
Oh, those are definitely coming in through the Biolink merge, since BL imports them. They can be omitted during the Phenio build.
@kevinschaper what you are observing can be answered only on the phenio level - any weird ID can come in through imports, we do not really control the prefix space in PHENIO at all. ODK does come with a way though to drop specific prefixes from the pipeline!
I'm doing an interactive kg build and starting looking at phenio filtering, and I'm going to reverse my include list, and go to this very short exclude list:
exclude_prefixes = [
"HGNC",
"FlyBase",
"http",
"biolink"
]
I'll also write out a file in qc for what was excluded by prefix.
Why are you excluding HGNC for example? are you not loosing some Mondo->HGNC links this way?
We're filtering out nodes and edges from kg-phenio by prefix currently:
https://github.com/monarch-initiative/monarch-ingest/blob/40020ec11b892d929632aa1b98b529b87511bd32/src/monarch_ingest/cli_utils.py#LL158C1-L161C1
I created this Iist initially to include ontology prefixes that we use in ingests. I realized recently that we don't have PATO terms in the graph, which makes me think that my list is filtering too aggressively and there are probably ontologies that provide connections within phenio that are important to have in the graph
Do you have feedback @cmungall @matentzn?