Open oneilsh opened 1 month ago
Making some progress on this one with a summary()
method for engines; it prints some summary info (unless quiet = TRUE is passed) and returns a list of the summary info invisibly:
> res <- monarch_engine() |> summary()
A Neo4j-backed knowledge graph engine.
Gathering statistics, please wait...
Total nodes: 1032366
Total edges: 8626674
Node category counts:
category count
biolink:Entity 1032366
biolink:NamedThing 1032366
biolink:ThingWithTaxon 999620
biolink:BiologicalEntity 999602
biolink:PhysicalEssenceOrOccurrent 845454
biolink:OntologyClass 783717
biolink:PhysicalEssence 779441
biolink:GenomicEntity 717557
biolink:ChemicalEntityOrGeneOrGeneProduct 577339
biolink:MacromolecularMachineMixin 574363
biolink:GeneOrGeneProduct 572248
biolink:Gene 571124
biolink:DiseaseOrPhenotypicFeature 153897
biolink:Genotype 133380
biolink:PhenotypicFeature 124544
biolink:Occurrent 66013
biolink:BiologicalProcessOrActivity 66009
biolink:SubjectOfInvestigation 58906
biolink:OrganismalEntity 58905
biolink:AnatomicalEntity 56807
biolink:Disease 29353
biolink:BiologicalProcess 26268
biolink:Cell 26159
biolink:Pathway 22331
biolink:GrossAnatomicalStructure 17122
biolink:SequenceVariant 13028
biolink:ChemicalEntityOrProteinOrPolypeptide 6209
biolink:CellularComponent 5319
biolink:ChemicalOrDrugOrTreatment 5095
biolink:ChemicalEntity 5094
biolink:MolecularEntity 4723
biolink:MacromolecularComplex 2115
biolink:MolecularActivity 1560
biolink:CellularOrganism 1532
biolink:GeneProductMixin 1124
biolink:Polypeptide 1115
biolink:Protein 1114
biolink:Vertebrate 561
biolink:Virus 322
biolink:BehavioralFeature 297
biolink:LifeStage 238
biolink:PathologicalProcess 231
biolink:PathologicalEntityMixin 231
biolink:ChemicalMixture 104
biolink:Drug 100
biolink:MolecularMixture 100
biolink:SmallMolecule 72
biolink:Attribute 50
biolink:InformationContentEntity 49
biolink:ClinicalAttribute 44
biolink:Onset 44
biolink:ClinicalCourse 44
biolink:OrganismTaxon 41
biolink:NucleicAcidEntity 18
biolink:Transcript 16
biolink:EvidenceType 16
biolink:RNAProduct 10
biolink:Plant 4
biolink:Publication 4
biolink:Fungus 4
biolink:ExposureEvent 3
biolink:PlanetaryEntity 3
biolink:ProcessedMaterial 3
biolink:Activity 3
biolink:ActivityAndBehavior 3
biolink:Mammal 3
biolink:GeneGroupingMixin 3
biolink:BiologicalSex 3
biolink:RegulatoryRegion 3
biolink:ChemicalExposure 2
biolink:EnvironmentalFeature 2
biolink:ConfidenceLevel 2
biolink:Dataset 2
biolink:AdministrativeEntity 2
biolink:Agent 2
biolink:Invertebrate 2
biolink:ProteinFamily 2
biolink:GeneticInheritance 2
biolink:Haplotype 2
biolink:PopulationOfIndividualOrganisms 2
biolink:NoncodingRNAProduct 2
biolink:DrugExposure 1
biolink:Patent 1
biolink:CellLine 1
biolink:EnvironmentalProcess 1
biolink:WebPage 1
biolink:DatasetDistribution 1
biolink:IndividualOrganism 1
biolink:Bacterium 1
biolink:ProteinDomain 1
biolink:StudyVariable 1
biolink:Human 1
biolink:Event 1
biolink:Study 1
biolink:Zygosity 1
biolink:ReagentTargetedGene 1
biolink:MaterialSample 1
biolink:PhysicalEntity 1
biolink:Treatment 1
biolink:PhenotypicSex 1
biolink:GenotypicSex 1
biolink:Exon 1
biolink:TranscriptionFactorBindingSite 1
biolink:MicroRNA 1
biolink:SiRNA 1
biolink:Genome 1
biolink:Snv 1
biolink:AccessibleDnaRegion 1
biolink:DiagnosticAid 1
Edge type counts:
predicate count
biolink:interacts_with 2483429
biolink:has_phenotype 1460300
biolink:expressed_in 1229152
biolink:actively_involved_in 632984
biolink:orthologous_to 551358
biolink:subclass_of 522698
biolink:enables 431854
biolink:located_in 312598
biolink:related_to 298633
biolink:participates_in 268302
biolink:acts_upstream_of_or_within 145303
biolink:active_in 144810
biolink:part_of 63174
biolink:causes 15628
biolink:is_sequence_variant_of 13050
biolink:model_of 9242
biolink:has_mode_of_inheritance 8577
biolink:gene_associated_with_condition 7971
biolink:acts_upstream_of 7095
biolink:contributes_to 6334
biolink:treats_or_applied_or_studied_to_treat 5945
biolink:colocalizes_with 2653
biolink:associated_with_increased_likelihood_of 2199
biolink:genetically_associated_with 2155
biolink:acts_upstream_of_positive_effect 473
biolink:acts_upstream_of_or_within_positive_effect 440
biolink:acts_upstream_of_negative_effect 165
biolink:acts_upstream_of_or_within_negative_effect 152
> str(res)
List of 4
$ node_summary:'data.frame': 109 obs. of 2 variables:
..$ category: chr [1:109] "biolink:Entity" "biolink:NamedThing" "biolink:ThingWithTaxon" "biolink:BiologicalEntity" ...
..$ count : int [1:109] 1032366 1032366 999620 999602 845454 783717 779441 717557 577339 574363 ...
$ edge_summary:'data.frame': 28 obs. of 2 variables:
..$ predicate: chr [1:28] "biolink:interacts_with" "biolink:has_phenotype" "biolink:expressed_in" "biolink:actively_involved_in" ...
..$ count : int [1:28] 2483429 1460300 1229152 632984 551358 522698 431854 312598 298633 268302 ...
$ total_nodes : int 1032366
$ total_edges : int 8626674
Not 100% happy, but it's a start.
In other ideas, I'm prototyping a sampling method that grabs a diversity of predicates and node categories, satisfying two conditions: A) every predicate is represented, B) every category is represented
1) sample one edge of every predicate (and the connected nodes) 2) identify the set of categories not yet represented so far 3) additionally sample one node each of those missing categories 4) join all of the above into a sample graph and return it
This isn't perfect:
It would be nice to have a small set of functions to help users orient to a KGs (ie, handled by an engine) contents for effective querying. Some ideas:
summarize_nodes - count nodes in the KG, broken out by category. This is somewhat complicated by overlapping categories; e.g. the count for GeneOrGeneProduct would subsume the count for Gene, and it's hard to know when a category is fully subsumed by another or something else. Counting by
pcategory
would be nice, may require some fancy cypher since those are defined post-fetchsummarize_edges - summarize the relationships in a KG, counts of different kinds, and maybe with some counts of the categories of nodes they connect? (e.g. there are N "biolink:has_phenotype" edges, N1 of which connect Disease -> Phenotype, N2 of which connect Gene -> Phenotype, etc)
sample - sample a set of nodes and edges. Perhaps filtered to specific categories or relationships of interest