monarch-initiative / monarch-app

Monarch Initiative website and API
https://monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
16 stars 3 forks source link

Filter phenio import by category #550

Open kevinschaper opened 5 months ago

kevinschaper commented 5 months ago

While checking on an older issue, I found that we're getting "gene" nodes from phenio:

category in_taxon id provided_by
biolink:Gene SIO:010035 phenio_nodes
biolink:Gene DATACOMMONS:Gene phenio_nodes

We likely want to have either an include or exclude filter on either the contents of kg-phenio or what parts of kg-phenio are imported into monarch-kg.

kevinschaper commented 5 months ago

Purely looking at category counts:

sqlite3 -markdown monarch-kg.db "select category, count(*) from nodes where provided_by = 'phenio_nodes' group by 1 order by 2 desc"
category count(*)
biolink:PhenotypicFeature 123272
biolink:BiologicalProcessOrActivity 38774
biolink:GrossAnatomicalStructure 28388
biolink:Disease 27697
biolink:Cell 18762
biolink:NamedThing 18613
biolink:AnatomicalEntity 11665
biolink:CellularComponent 5283
biolink:MolecularEntity 4582
biolink:BiologicalProcess 3133
biolink:MacromolecularComplex 2131
biolink:MolecularActivity 1311
biolink:Protein 1117
biolink:CellularOrganism 955
biolink:PhenotypicQuality 744
biolink:Pathway 662
biolink:Vertebrate 391
biolink:Virus 320
biolink:BehavioralFeature 297
biolink:LifeStage 238
biolink:PathologicalProcess 231
biolink:ChemicalEntity 221
biolink:Drug 101
biolink:OrganismTaxon 93
biolink:SmallMolecule 70
biolink:SequenceVariant 70
biolink:InformationContentEntity 23
biolink:NucleicAcidEntity 18
biolink:EvidenceType 16
biolink:GeographicExposure 14
biolink:RNAProduct 12
biolink:Transcript 6
biolink:Plant 4
biolink:Fungus 4
biolink:ProteinFamily 3
biolink:PopulationOfIndividualOrganisms 3
biolink:Invertebrate 3
biolink:Dataset 3
biolink:WebPage 2
biolink:Treatment 2
biolink:Study 2
biolink:RegulatoryRegion 2
biolink:Publication 2
biolink:ProteinDomain 2
biolink:Patent 2
biolink:MicroRNA 2
biolink:MaterialSample 2
biolink:Mammal 2
biolink:IndividualOrganism 2
biolink:Human 2
biolink:Haplotype 2
biolink:Genotype 2
biolink:Genome 2
biolink:GeneticInheritance 2
biolink:Gene 2
biolink:Exon 2
biolink:EnvironmentalFeature 2
biolink:ConfidenceLevel 2
biolink:ChemicalExposure 2
biolink:Agent 2
biolink:Activity 2
biolink:Zygosity 1
biolink:TranscriptionFactorBindingSite 1
biolink:StudyVariable 1
biolink:Snv 1
biolink:SiRNA 1
biolink:ReagentTargetedGene 1
biolink:ProcessedMaterial 1
biolink:Procedure 1
biolink:Polypeptide 1
biolink:PhenotypicSex 1
biolink:OrganismalEntity 1
biolink:NoncodingRNAProduct 1
biolink:GenotypicSex 1
biolink:Event 1
biolink:EnvironmentalProcess 1
biolink:DrugExposure 1
biolink:DiagnosticAid 1
biolink:DatasetDistribution 1
biolink:CodingSequence 1
biolink:ChemicalMixture 1
biolink:CellLine 1
biolink:BiologicalSex 1
biolink:BiologicalEntity 1
biolink:Bacterium 1
biolink:Attribute 1
biolink:Article 1
biolink:AccessibleDnaRegion 1
kevinschaper commented 5 months ago

blowing it up further with prefixes:

category prefix count(*)
biolink:PhenotypicFeature ZP 39373
biolink:BiologicalProcessOrActivity GO 38059
biolink:Disease MONDO 27691
biolink:PhenotypicFeature UPHENO 21361
biolink:PhenotypicFeature XPO 20340
biolink:PhenotypicFeature HP 17881
biolink:PhenotypicFeature MP 13796
biolink:Cell FBbt 12488
biolink:GrossAnatomicalStructure UBERON 10333
biolink:PhenotypicFeature FYPO 7880
biolink:NamedThing OBA 5984
biolink:AnatomicalEntity UBERON 5444
biolink:GrossAnatomicalStructure EMAPA 4639
biolink:MolecularEntity CHEBI 4581
biolink:GrossAnatomicalStructure FMA 4242
biolink:GrossAnatomicalStructure FBbt 4198
biolink:NamedThing GO 3646
biolink:Cell WBbt 3388
biolink:BiologicalProcess GO 3132
biolink:NamedThing EMAPA 2992
biolink:CellularComponent WBbt 2908
biolink:PhenotypicFeature WBPhenotype 2636
biolink:GrossAnatomicalStructure MA 2449
biolink:CellularComponent GO 2368
biolink:AnatomicalEntity FBbt 2311
biolink:GrossAnatomicalStructure ZFA 2201
biolink:MacromolecularComplex GO 2127
biolink:AnatomicalEntity FMA 1769
biolink:Cell CL 1761
biolink:NamedThing XAO 1611
biolink:MolecularActivity GO 1311
biolink:AnatomicalEntity EMAPA 1129
biolink:NamedThing FBbt 1113
biolink:Protein PR 1095
biolink:CellularOrganism NCBITaxon 955
biolink:NamedThing WBbt 793
biolink:Pathway GO 660
biolink:PhenotypicQuality PATO 655
biolink:Cell ZFA 641
biolink:BiologicalProcessOrActivity NBO 635
biolink:AnatomicalEntity MA 611
biolink:NamedThing RO 529
biolink:Cell FMA 474
biolink:NamedThing MP 448
biolink:Vertebrate NCBITaxon 390
biolink:NamedThing FYPO 329
biolink:GrossAnatomicalStructure WBbt 326
biolink:Virus NCBITaxon 320
biolink:BehavioralFeature NBO 297
biolink:AnatomicalEntity ZFA 257
biolink:LifeStage HSAPDV 238
biolink:PathologicalProcess MPATH 228
biolink:ChemicalEntity CHEBI 220
biolink:NamedThing CHR 203
biolink:AnatomicalEntity WBbt 140
biolink:NamedThing ZFA 118
biolink:NamedThing FAO 115
biolink:Drug CHEBI 100
biolink:NamedThing OBO 94
biolink:OrganismTaxon NCBITaxon 92
biolink:NamedThing BSPO 75
biolink:NamedThing IAO 70
biolink:SequenceVariant SO 70
biolink:SmallMolecule CHEBI 70
biolink:NamedThing WBPhenotype 62
biolink:BiologicalProcessOrActivity UBERON 60
biolink:NamedThing PO 58
biolink:NamedThing MPATH 52
biolink:PhenotypicQuality CHEBI 47
biolink:PhenotypicQuality MONDO 41
biolink:NamedThing ZFS 38
biolink:NamedThing UBPROP 31
biolink:NamedThing BFO 29
biolink:NamedThing TS 28
biolink:NamedThing HSAPDV 21
biolink:NamedThing NCIT 21
biolink:Protein CHEBI 21
biolink:NamedThing ENVO 20
biolink:NamedThing UPHENO 20
biolink:NucleicAcidEntity SO 18
biolink:EvidenceType ECO 16
biolink:NamedThing FMA 16
biolink:InformationContentEntity ECO 15
biolink:NamedThing ECTO 15
biolink:GeographicExposure DATACOMMONS 14
biolink:NamedThing OMO 13
biolink:NamedThing OIO 12
biolink:BiologicalProcessOrActivity BFO 9
biolink:NamedThing LINKML 9
biolink:RNAProduct SO 9
biolink:InformationContentEntity IAO 8
biolink:NamedThing OBI 8
biolink:BiologicalProcessOrActivity PO 7
biolink:CellularComponent CL 6
biolink:NamedThing MFOMD 6
biolink:Cell EMAPA 5
biolink:NamedThing ECO 4
biolink:NamedThing MF 4
biolink:NamedThing dc 4
biolink:Transcript SO 4
biolink:BiologicalProcessOrActivity FMA 3
biolink:Cell MA 3
biolink:MacromolecularComplex PR 3
biolink:NamedThing CARO 3
biolink:NamedThing owl 3
biolink:RNAProduct CHEBI 3
biolink:EnvironmentalFeature ENVO 2
biolink:Fungus FOODON 2
biolink:NamedThing GOREL 2
biolink:NamedThing MAXO 2
biolink:NamedThing NBO 2
biolink:NamedThing rdfs 2
biolink:PathologicalProcess NCIT 2
biolink:Plant NCIT 2
biolink:Plant PO 2
biolink:ProteinFamily NCIT 2
biolink:Publication IAO 2
biolink:AccessibleDnaRegion SO 1
biolink:Activity NCIT 1
biolink:Activity PROV 1
biolink:Agent PROV 1
biolink:Agent dcterms 1
biolink:AnatomicalEntity CARO 1
biolink:AnatomicalEntity NCIT 1
biolink:AnatomicalEntity SIO 1
biolink:AnatomicalEntity XAO 1
biolink:Article SIO 1
biolink:Attribute SIO 1
biolink:Bacterium NCBITaxon 1
biolink:BiologicalEntity SIO 1
biolink:BiologicalProcess SIO 1
biolink:BiologicalProcessOrActivity RO 1
biolink:BiologicalSex PATO 1
biolink:Cell GO 1
biolink:Cell SIO 1
biolink:CellLine CLO 1
biolink:CellularComponent SIO 1
biolink:ChemicalEntity SIO 1
biolink:ChemicalExposure ECTO 1
biolink:ChemicalExposure SIO 1
biolink:ChemicalMixture NCIT 1
biolink:CodingSequence SIO 1
biolink:ConfidenceLevel CIO 1
biolink:ConfidenceLevel SEPIO 1
biolink:Dataset DATACOMMONS 1
biolink:Dataset IAO 1
biolink:Dataset dctypes 1
biolink:DatasetDistribution dcat 1
biolink:DiagnosticAid SNOMED 1
biolink:Disease DATACOMMONS 1
biolink:Disease DOID 1
biolink:Disease NCIT 1
biolink:Disease Orphanet 1
biolink:Disease SIO 1
biolink:Disease UMLS 1
biolink:Drug DATACOMMONS 1
biolink:DrugExposure ECTO 1
biolink:EnvironmentalProcess ENVO 1
biolink:Event NCIT 1
biolink:Exon SIO 1
biolink:Exon SO 1
biolink:Fungus NCBITaxon 1
biolink:Fungus NCIT 1
biolink:Gene DATACOMMONS 1
biolink:Gene SIO 1
biolink:GeneticInheritance GENO 1
biolink:GeneticInheritance NCIT 1
biolink:Genome SIO 1
biolink:Genome SO 1
biolink:Genotype GENO 1
biolink:Genotype SIO 1
biolink:GenotypicSex PATO 1
biolink:Haplotype GENO 1
biolink:Haplotype SO 1
biolink:Human NCIT 1
biolink:Human SIO 1
biolink:IndividualOrganism SIO 1
biolink:IndividualOrganism foaf 1
biolink:Invertebrate FOODON 1
biolink:Invertebrate NCIT 1
biolink:Invertebrate OMIT 1
biolink:MacromolecularComplex CL 1
biolink:Mammal FOODON 1
biolink:Mammal NCIT 1
biolink:MaterialSample OBI 1
biolink:MaterialSample SIO 1
biolink:MicroRNA SIO 1
biolink:MicroRNA SO 1
biolink:MolecularEntity PR 1
biolink:NamedThing BTO 1
biolink:NamedThing DATACOMMONS 1
biolink:NamedThing MOD 1
biolink:NamedThing OGMS 1
biolink:NamedThing PHENIO 1
biolink:NamedThing RNORDV 1
biolink:NamedThing dcterms 1
biolink:NamedThing foaf 1
biolink:NoncodingRNAProduct SIO 1
biolink:OrganismTaxon DATACOMMONS 1
biolink:OrganismalEntity CARO 1
biolink:Patent IAO 1
biolink:Patent SIO 1
biolink:PathologicalProcess OBI 1
biolink:Pathway PW 1
biolink:Pathway SIO 1
biolink:PhenotypicFeature APO 1
biolink:PhenotypicFeature FBcv 1
biolink:PhenotypicFeature NCIT 1
biolink:PhenotypicFeature SIO 1
biolink:PhenotypicFeature TO 1
biolink:PhenotypicQuality BFO 1
biolink:PhenotypicSex PATO 1
biolink:Polypeptide SO 1
biolink:PopulationOfIndividualOrganisms OBI 1
biolink:PopulationOfIndividualOrganisms PCO 1
biolink:PopulationOfIndividualOrganisms SIO 1
biolink:Procedure DATACOMMONS 1
biolink:ProcessedMaterial OBI 1
biolink:Protein SIO 1
biolink:ProteinDomain NCIT 1
biolink:ProteinDomain SIO 1
biolink:ProteinFamily SIO 1
biolink:ReagentTargetedGene GENO 1
biolink:RegulatoryRegion SIO 1
biolink:RegulatoryRegion SO 1
biolink:SiRNA SO 1
biolink:Snv SO 1
biolink:Study NCIT 1
biolink:Study SIO 1
biolink:StudyVariable NCIT 1
biolink:Transcript DATACOMMONS 1
biolink:Transcript SIO 1
biolink:TranscriptionFactorBindingSite SO 1
biolink:Treatment OGMS 1
biolink:Treatment SIO 1
biolink:Vertebrate OMIT 1
biolink:WebPage OBO 1
biolink:WebPage SIO 1
biolink:Zygosity GENO 1
caufieldjh commented 5 months ago

Those two Gene nodes will be removed upstream. The category assignments are coming in directly from Biolink Model, e.g. https://github.com/biolink/biolink-model/blob/569ecf63ae59bfd200dda8dd871ed50c2dff4345/biolink-model.yaml#L8283 (although Biolink uses the dcid prefix which then becomes DATACOMMONS in PHENIO) So they're technically correct in that everything that is a DATACOMMONS:Gene should be a biolink:Gene and so on, but at present we don't have anything in that first category and it's not a Gene in the sense of most other classes.

caufieldjh commented 3 months ago

Are those extra Gene nodes still showing up? Are there other priority areas to handle upstream?