Open kevinschaper opened 5 months ago
Purely looking at category counts:
sqlite3 -markdown monarch-kg.db "select category, count(*) from nodes where provided_by = 'phenio_nodes' group by 1 order by 2 desc"
category | count(*) |
---|---|
biolink:PhenotypicFeature | 123272 |
biolink:BiologicalProcessOrActivity | 38774 |
biolink:GrossAnatomicalStructure | 28388 |
biolink:Disease | 27697 |
biolink:Cell | 18762 |
biolink:NamedThing | 18613 |
biolink:AnatomicalEntity | 11665 |
biolink:CellularComponent | 5283 |
biolink:MolecularEntity | 4582 |
biolink:BiologicalProcess | 3133 |
biolink:MacromolecularComplex | 2131 |
biolink:MolecularActivity | 1311 |
biolink:Protein | 1117 |
biolink:CellularOrganism | 955 |
biolink:PhenotypicQuality | 744 |
biolink:Pathway | 662 |
biolink:Vertebrate | 391 |
biolink:Virus | 320 |
biolink:BehavioralFeature | 297 |
biolink:LifeStage | 238 |
biolink:PathologicalProcess | 231 |
biolink:ChemicalEntity | 221 |
biolink:Drug | 101 |
biolink:OrganismTaxon | 93 |
biolink:SmallMolecule | 70 |
biolink:SequenceVariant | 70 |
biolink:InformationContentEntity | 23 |
biolink:NucleicAcidEntity | 18 |
biolink:EvidenceType | 16 |
biolink:GeographicExposure | 14 |
biolink:RNAProduct | 12 |
biolink:Transcript | 6 |
biolink:Plant | 4 |
biolink:Fungus | 4 |
biolink:ProteinFamily | 3 |
biolink:PopulationOfIndividualOrganisms | 3 |
biolink:Invertebrate | 3 |
biolink:Dataset | 3 |
biolink:WebPage | 2 |
biolink:Treatment | 2 |
biolink:Study | 2 |
biolink:RegulatoryRegion | 2 |
biolink:Publication | 2 |
biolink:ProteinDomain | 2 |
biolink:Patent | 2 |
biolink:MicroRNA | 2 |
biolink:MaterialSample | 2 |
biolink:Mammal | 2 |
biolink:IndividualOrganism | 2 |
biolink:Human | 2 |
biolink:Haplotype | 2 |
biolink:Genotype | 2 |
biolink:Genome | 2 |
biolink:GeneticInheritance | 2 |
biolink:Gene | 2 |
biolink:Exon | 2 |
biolink:EnvironmentalFeature | 2 |
biolink:ConfidenceLevel | 2 |
biolink:ChemicalExposure | 2 |
biolink:Agent | 2 |
biolink:Activity | 2 |
biolink:Zygosity | 1 |
biolink:TranscriptionFactorBindingSite | 1 |
biolink:StudyVariable | 1 |
biolink:Snv | 1 |
biolink:SiRNA | 1 |
biolink:ReagentTargetedGene | 1 |
biolink:ProcessedMaterial | 1 |
biolink:Procedure | 1 |
biolink:Polypeptide | 1 |
biolink:PhenotypicSex | 1 |
biolink:OrganismalEntity | 1 |
biolink:NoncodingRNAProduct | 1 |
biolink:GenotypicSex | 1 |
biolink:Event | 1 |
biolink:EnvironmentalProcess | 1 |
biolink:DrugExposure | 1 |
biolink:DiagnosticAid | 1 |
biolink:DatasetDistribution | 1 |
biolink:CodingSequence | 1 |
biolink:ChemicalMixture | 1 |
biolink:CellLine | 1 |
biolink:BiologicalSex | 1 |
biolink:BiologicalEntity | 1 |
biolink:Bacterium | 1 |
biolink:Attribute | 1 |
biolink:Article | 1 |
biolink:AccessibleDnaRegion | 1 |
blowing it up further with prefixes:
category | prefix | count(*) |
---|---|---|
biolink:PhenotypicFeature | ZP | 39373 |
biolink:BiologicalProcessOrActivity | GO | 38059 |
biolink:Disease | MONDO | 27691 |
biolink:PhenotypicFeature | UPHENO | 21361 |
biolink:PhenotypicFeature | XPO | 20340 |
biolink:PhenotypicFeature | HP | 17881 |
biolink:PhenotypicFeature | MP | 13796 |
biolink:Cell | FBbt | 12488 |
biolink:GrossAnatomicalStructure | UBERON | 10333 |
biolink:PhenotypicFeature | FYPO | 7880 |
biolink:NamedThing | OBA | 5984 |
biolink:AnatomicalEntity | UBERON | 5444 |
biolink:GrossAnatomicalStructure | EMAPA | 4639 |
biolink:MolecularEntity | CHEBI | 4581 |
biolink:GrossAnatomicalStructure | FMA | 4242 |
biolink:GrossAnatomicalStructure | FBbt | 4198 |
biolink:NamedThing | GO | 3646 |
biolink:Cell | WBbt | 3388 |
biolink:BiologicalProcess | GO | 3132 |
biolink:NamedThing | EMAPA | 2992 |
biolink:CellularComponent | WBbt | 2908 |
biolink:PhenotypicFeature | WBPhenotype | 2636 |
biolink:GrossAnatomicalStructure | MA | 2449 |
biolink:CellularComponent | GO | 2368 |
biolink:AnatomicalEntity | FBbt | 2311 |
biolink:GrossAnatomicalStructure | ZFA | 2201 |
biolink:MacromolecularComplex | GO | 2127 |
biolink:AnatomicalEntity | FMA | 1769 |
biolink:Cell | CL | 1761 |
biolink:NamedThing | XAO | 1611 |
biolink:MolecularActivity | GO | 1311 |
biolink:AnatomicalEntity | EMAPA | 1129 |
biolink:NamedThing | FBbt | 1113 |
biolink:Protein | PR | 1095 |
biolink:CellularOrganism | NCBITaxon | 955 |
biolink:NamedThing | WBbt | 793 |
biolink:Pathway | GO | 660 |
biolink:PhenotypicQuality | PATO | 655 |
biolink:Cell | ZFA | 641 |
biolink:BiologicalProcessOrActivity | NBO | 635 |
biolink:AnatomicalEntity | MA | 611 |
biolink:NamedThing | RO | 529 |
biolink:Cell | FMA | 474 |
biolink:NamedThing | MP | 448 |
biolink:Vertebrate | NCBITaxon | 390 |
biolink:NamedThing | FYPO | 329 |
biolink:GrossAnatomicalStructure | WBbt | 326 |
biolink:Virus | NCBITaxon | 320 |
biolink:BehavioralFeature | NBO | 297 |
biolink:AnatomicalEntity | ZFA | 257 |
biolink:LifeStage | HSAPDV | 238 |
biolink:PathologicalProcess | MPATH | 228 |
biolink:ChemicalEntity | CHEBI | 220 |
biolink:NamedThing | CHR | 203 |
biolink:AnatomicalEntity | WBbt | 140 |
biolink:NamedThing | ZFA | 118 |
biolink:NamedThing | FAO | 115 |
biolink:Drug | CHEBI | 100 |
biolink:NamedThing | OBO | 94 |
biolink:OrganismTaxon | NCBITaxon | 92 |
biolink:NamedThing | BSPO | 75 |
biolink:NamedThing | IAO | 70 |
biolink:SequenceVariant | SO | 70 |
biolink:SmallMolecule | CHEBI | 70 |
biolink:NamedThing | WBPhenotype | 62 |
biolink:BiologicalProcessOrActivity | UBERON | 60 |
biolink:NamedThing | PO | 58 |
biolink:NamedThing | MPATH | 52 |
biolink:PhenotypicQuality | CHEBI | 47 |
biolink:PhenotypicQuality | MONDO | 41 |
biolink:NamedThing | ZFS | 38 |
biolink:NamedThing | UBPROP | 31 |
biolink:NamedThing | BFO | 29 |
biolink:NamedThing | TS | 28 |
biolink:NamedThing | HSAPDV | 21 |
biolink:NamedThing | NCIT | 21 |
biolink:Protein | CHEBI | 21 |
biolink:NamedThing | ENVO | 20 |
biolink:NamedThing | UPHENO | 20 |
biolink:NucleicAcidEntity | SO | 18 |
biolink:EvidenceType | ECO | 16 |
biolink:NamedThing | FMA | 16 |
biolink:InformationContentEntity | ECO | 15 |
biolink:NamedThing | ECTO | 15 |
biolink:GeographicExposure | DATACOMMONS | 14 |
biolink:NamedThing | OMO | 13 |
biolink:NamedThing | OIO | 12 |
biolink:BiologicalProcessOrActivity | BFO | 9 |
biolink:NamedThing | LINKML | 9 |
biolink:RNAProduct | SO | 9 |
biolink:InformationContentEntity | IAO | 8 |
biolink:NamedThing | OBI | 8 |
biolink:BiologicalProcessOrActivity | PO | 7 |
biolink:CellularComponent | CL | 6 |
biolink:NamedThing | MFOMD | 6 |
biolink:Cell | EMAPA | 5 |
biolink:NamedThing | ECO | 4 |
biolink:NamedThing | MF | 4 |
biolink:NamedThing | dc | 4 |
biolink:Transcript | SO | 4 |
biolink:BiologicalProcessOrActivity | FMA | 3 |
biolink:Cell | MA | 3 |
biolink:MacromolecularComplex | PR | 3 |
biolink:NamedThing | CARO | 3 |
biolink:NamedThing | owl | 3 |
biolink:RNAProduct | CHEBI | 3 |
biolink:EnvironmentalFeature | ENVO | 2 |
biolink:Fungus | FOODON | 2 |
biolink:NamedThing | GOREL | 2 |
biolink:NamedThing | MAXO | 2 |
biolink:NamedThing | NBO | 2 |
biolink:NamedThing | rdfs | 2 |
biolink:PathologicalProcess | NCIT | 2 |
biolink:Plant | NCIT | 2 |
biolink:Plant | PO | 2 |
biolink:ProteinFamily | NCIT | 2 |
biolink:Publication | IAO | 2 |
biolink:AccessibleDnaRegion | SO | 1 |
biolink:Activity | NCIT | 1 |
biolink:Activity | PROV | 1 |
biolink:Agent | PROV | 1 |
biolink:Agent | dcterms | 1 |
biolink:AnatomicalEntity | CARO | 1 |
biolink:AnatomicalEntity | NCIT | 1 |
biolink:AnatomicalEntity | SIO | 1 |
biolink:AnatomicalEntity | XAO | 1 |
biolink:Article | SIO | 1 |
biolink:Attribute | SIO | 1 |
biolink:Bacterium | NCBITaxon | 1 |
biolink:BiologicalEntity | SIO | 1 |
biolink:BiologicalProcess | SIO | 1 |
biolink:BiologicalProcessOrActivity | RO | 1 |
biolink:BiologicalSex | PATO | 1 |
biolink:Cell | GO | 1 |
biolink:Cell | SIO | 1 |
biolink:CellLine | CLO | 1 |
biolink:CellularComponent | SIO | 1 |
biolink:ChemicalEntity | SIO | 1 |
biolink:ChemicalExposure | ECTO | 1 |
biolink:ChemicalExposure | SIO | 1 |
biolink:ChemicalMixture | NCIT | 1 |
biolink:CodingSequence | SIO | 1 |
biolink:ConfidenceLevel | CIO | 1 |
biolink:ConfidenceLevel | SEPIO | 1 |
biolink:Dataset | DATACOMMONS | 1 |
biolink:Dataset | IAO | 1 |
biolink:Dataset | dctypes | 1 |
biolink:DatasetDistribution | dcat | 1 |
biolink:DiagnosticAid | SNOMED | 1 |
biolink:Disease | DATACOMMONS | 1 |
biolink:Disease | DOID | 1 |
biolink:Disease | NCIT | 1 |
biolink:Disease | Orphanet | 1 |
biolink:Disease | SIO | 1 |
biolink:Disease | UMLS | 1 |
biolink:Drug | DATACOMMONS | 1 |
biolink:DrugExposure | ECTO | 1 |
biolink:EnvironmentalProcess | ENVO | 1 |
biolink:Event | NCIT | 1 |
biolink:Exon | SIO | 1 |
biolink:Exon | SO | 1 |
biolink:Fungus | NCBITaxon | 1 |
biolink:Fungus | NCIT | 1 |
biolink:Gene | DATACOMMONS | 1 |
biolink:Gene | SIO | 1 |
biolink:GeneticInheritance | GENO | 1 |
biolink:GeneticInheritance | NCIT | 1 |
biolink:Genome | SIO | 1 |
biolink:Genome | SO | 1 |
biolink:Genotype | GENO | 1 |
biolink:Genotype | SIO | 1 |
biolink:GenotypicSex | PATO | 1 |
biolink:Haplotype | GENO | 1 |
biolink:Haplotype | SO | 1 |
biolink:Human | NCIT | 1 |
biolink:Human | SIO | 1 |
biolink:IndividualOrganism | SIO | 1 |
biolink:IndividualOrganism | foaf | 1 |
biolink:Invertebrate | FOODON | 1 |
biolink:Invertebrate | NCIT | 1 |
biolink:Invertebrate | OMIT | 1 |
biolink:MacromolecularComplex | CL | 1 |
biolink:Mammal | FOODON | 1 |
biolink:Mammal | NCIT | 1 |
biolink:MaterialSample | OBI | 1 |
biolink:MaterialSample | SIO | 1 |
biolink:MicroRNA | SIO | 1 |
biolink:MicroRNA | SO | 1 |
biolink:MolecularEntity | PR | 1 |
biolink:NamedThing | BTO | 1 |
biolink:NamedThing | DATACOMMONS | 1 |
biolink:NamedThing | MOD | 1 |
biolink:NamedThing | OGMS | 1 |
biolink:NamedThing | PHENIO | 1 |
biolink:NamedThing | RNORDV | 1 |
biolink:NamedThing | dcterms | 1 |
biolink:NamedThing | foaf | 1 |
biolink:NoncodingRNAProduct | SIO | 1 |
biolink:OrganismTaxon | DATACOMMONS | 1 |
biolink:OrganismalEntity | CARO | 1 |
biolink:Patent | IAO | 1 |
biolink:Patent | SIO | 1 |
biolink:PathologicalProcess | OBI | 1 |
biolink:Pathway | PW | 1 |
biolink:Pathway | SIO | 1 |
biolink:PhenotypicFeature | APO | 1 |
biolink:PhenotypicFeature | FBcv | 1 |
biolink:PhenotypicFeature | NCIT | 1 |
biolink:PhenotypicFeature | SIO | 1 |
biolink:PhenotypicFeature | TO | 1 |
biolink:PhenotypicQuality | BFO | 1 |
biolink:PhenotypicSex | PATO | 1 |
biolink:Polypeptide | SO | 1 |
biolink:PopulationOfIndividualOrganisms | OBI | 1 |
biolink:PopulationOfIndividualOrganisms | PCO | 1 |
biolink:PopulationOfIndividualOrganisms | SIO | 1 |
biolink:Procedure | DATACOMMONS | 1 |
biolink:ProcessedMaterial | OBI | 1 |
biolink:Protein | SIO | 1 |
biolink:ProteinDomain | NCIT | 1 |
biolink:ProteinDomain | SIO | 1 |
biolink:ProteinFamily | SIO | 1 |
biolink:ReagentTargetedGene | GENO | 1 |
biolink:RegulatoryRegion | SIO | 1 |
biolink:RegulatoryRegion | SO | 1 |
biolink:SiRNA | SO | 1 |
biolink:Snv | SO | 1 |
biolink:Study | NCIT | 1 |
biolink:Study | SIO | 1 |
biolink:StudyVariable | NCIT | 1 |
biolink:Transcript | DATACOMMONS | 1 |
biolink:Transcript | SIO | 1 |
biolink:TranscriptionFactorBindingSite | SO | 1 |
biolink:Treatment | OGMS | 1 |
biolink:Treatment | SIO | 1 |
biolink:Vertebrate | OMIT | 1 |
biolink:WebPage | OBO | 1 |
biolink:WebPage | SIO | 1 |
biolink:Zygosity | GENO | 1 |
Those two Gene nodes will be removed upstream.
The category assignments are coming in directly from Biolink Model, e.g.
https://github.com/biolink/biolink-model/blob/569ecf63ae59bfd200dda8dd871ed50c2dff4345/biolink-model.yaml#L8283 (although Biolink uses the dcid
prefix which then becomes DATACOMMONS in PHENIO)
So they're technically correct in that everything that is a DATACOMMONS:Gene
should be a biolink:Gene
and so on, but at present we don't have anything in that first category and it's not a Gene in the sense of most other classes.
Are those extra Gene
nodes still showing up?
Are there other priority areas to handle upstream?
While checking on an older issue, I found that we're getting "gene" nodes from phenio:
We likely want to have either an include or exclude filter on either the contents of kg-phenio or what parts of kg-phenio are imported into monarch-kg.