monarch-initiative / monarchr

R package for easy access, manipulation, and analysis of Monarch KG data
Other
8 stars 1 forks source link

KG engine exploration #23

Open oneilsh opened 1 month ago

oneilsh commented 1 month ago

It would be nice to have a small set of functions to help users orient to a KGs (ie, handled by an engine) contents for effective querying. Some ideas:

summarize_nodes - count nodes in the KG, broken out by category. This is somewhat complicated by overlapping categories; e.g. the count for GeneOrGeneProduct would subsume the count for Gene, and it's hard to know when a category is fully subsumed by another or something else. Counting by pcategory would be nice, may require some fancy cypher since those are defined post-fetch

summarize_edges - summarize the relationships in a KG, counts of different kinds, and maybe with some counts of the categories of nodes they connect? (e.g. there are N "biolink:has_phenotype" edges, N1 of which connect Disease -> Phenotype, N2 of which connect Gene -> Phenotype, etc)

sample - sample a set of nodes and edges. Perhaps filtered to specific categories or relationships of interest

oneilsh commented 1 week ago

Making some progress on this one with a summary() method for engines; it prints some summary info (unless quiet = TRUE is passed) and returns a list of the summary info invisibly:

> res <- monarch_engine() |> summary()

A Neo4j-backed knowledge graph engine.
Gathering statistics, please wait...
Total nodes:  1032366 
Total edges:  8626674 

Node category counts:
                                     category   count
                               biolink:Entity 1032366
                           biolink:NamedThing 1032366
                       biolink:ThingWithTaxon  999620
                     biolink:BiologicalEntity  999602
           biolink:PhysicalEssenceOrOccurrent  845454
                        biolink:OntologyClass  783717
                      biolink:PhysicalEssence  779441
                        biolink:GenomicEntity  717557
    biolink:ChemicalEntityOrGeneOrGeneProduct  577339
           biolink:MacromolecularMachineMixin  574363
                    biolink:GeneOrGeneProduct  572248
                                 biolink:Gene  571124
           biolink:DiseaseOrPhenotypicFeature  153897
                             biolink:Genotype  133380
                    biolink:PhenotypicFeature  124544
                            biolink:Occurrent   66013
          biolink:BiologicalProcessOrActivity   66009
               biolink:SubjectOfInvestigation   58906
                     biolink:OrganismalEntity   58905
                     biolink:AnatomicalEntity   56807
                              biolink:Disease   29353
                    biolink:BiologicalProcess   26268
                                 biolink:Cell   26159
                              biolink:Pathway   22331
             biolink:GrossAnatomicalStructure   17122
                      biolink:SequenceVariant   13028
 biolink:ChemicalEntityOrProteinOrPolypeptide    6209
                    biolink:CellularComponent    5319
            biolink:ChemicalOrDrugOrTreatment    5095
                       biolink:ChemicalEntity    5094
                      biolink:MolecularEntity    4723
                biolink:MacromolecularComplex    2115
                    biolink:MolecularActivity    1560
                     biolink:CellularOrganism    1532
                     biolink:GeneProductMixin    1124
                          biolink:Polypeptide    1115
                              biolink:Protein    1114
                           biolink:Vertebrate     561
                                biolink:Virus     322
                    biolink:BehavioralFeature     297
                            biolink:LifeStage     238
                  biolink:PathologicalProcess     231
              biolink:PathologicalEntityMixin     231
                      biolink:ChemicalMixture     104
                                 biolink:Drug     100
                     biolink:MolecularMixture     100
                        biolink:SmallMolecule      72
                            biolink:Attribute      50
             biolink:InformationContentEntity      49
                    biolink:ClinicalAttribute      44
                                biolink:Onset      44
                       biolink:ClinicalCourse      44
                        biolink:OrganismTaxon      41
                    biolink:NucleicAcidEntity      18
                           biolink:Transcript      16
                         biolink:EvidenceType      16
                           biolink:RNAProduct      10
                                biolink:Plant       4
                          biolink:Publication       4
                               biolink:Fungus       4
                        biolink:ExposureEvent       3
                      biolink:PlanetaryEntity       3
                    biolink:ProcessedMaterial       3
                             biolink:Activity       3
                  biolink:ActivityAndBehavior       3
                               biolink:Mammal       3
                    biolink:GeneGroupingMixin       3
                        biolink:BiologicalSex       3
                     biolink:RegulatoryRegion       3
                     biolink:ChemicalExposure       2
                 biolink:EnvironmentalFeature       2
                      biolink:ConfidenceLevel       2
                              biolink:Dataset       2
                 biolink:AdministrativeEntity       2
                                biolink:Agent       2
                         biolink:Invertebrate       2
                        biolink:ProteinFamily       2
                   biolink:GeneticInheritance       2
                            biolink:Haplotype       2
      biolink:PopulationOfIndividualOrganisms       2
                  biolink:NoncodingRNAProduct       2
                         biolink:DrugExposure       1
                               biolink:Patent       1
                             biolink:CellLine       1
                 biolink:EnvironmentalProcess       1
                              biolink:WebPage       1
                  biolink:DatasetDistribution       1
                   biolink:IndividualOrganism       1
                            biolink:Bacterium       1
                        biolink:ProteinDomain       1
                        biolink:StudyVariable       1
                                biolink:Human       1
                                biolink:Event       1
                                biolink:Study       1
                             biolink:Zygosity       1
                  biolink:ReagentTargetedGene       1
                       biolink:MaterialSample       1
                       biolink:PhysicalEntity       1
                            biolink:Treatment       1
                        biolink:PhenotypicSex       1
                         biolink:GenotypicSex       1
                                 biolink:Exon       1
       biolink:TranscriptionFactorBindingSite       1
                             biolink:MicroRNA       1
                                biolink:SiRNA       1
                               biolink:Genome       1
                                  biolink:Snv       1
                  biolink:AccessibleDnaRegion       1
                        biolink:DiagnosticAid       1

Edge type counts:
                                          predicate   count
                             biolink:interacts_with 2483429
                              biolink:has_phenotype 1460300
                               biolink:expressed_in 1229152
                       biolink:actively_involved_in  632984
                             biolink:orthologous_to  551358
                                biolink:subclass_of  522698
                                    biolink:enables  431854
                                 biolink:located_in  312598
                                 biolink:related_to  298633
                            biolink:participates_in  268302
                 biolink:acts_upstream_of_or_within  145303
                                  biolink:active_in  144810
                                    biolink:part_of   63174
                                     biolink:causes   15628
                     biolink:is_sequence_variant_of   13050
                                   biolink:model_of    9242
                    biolink:has_mode_of_inheritance    8577
             biolink:gene_associated_with_condition    7971
                           biolink:acts_upstream_of    7095
                             biolink:contributes_to    6334
      biolink:treats_or_applied_or_studied_to_treat    5945
                           biolink:colocalizes_with    2653
    biolink:associated_with_increased_likelihood_of    2199
                biolink:genetically_associated_with    2155
           biolink:acts_upstream_of_positive_effect     473
 biolink:acts_upstream_of_or_within_positive_effect     440
           biolink:acts_upstream_of_negative_effect     165
 biolink:acts_upstream_of_or_within_negative_effect     152
> str(res)
List of 4
 $ node_summary:'data.frame':   109 obs. of  2 variables:
  ..$ category: chr [1:109] "biolink:Entity" "biolink:NamedThing" "biolink:ThingWithTaxon" "biolink:BiologicalEntity" ...
  ..$ count   : int [1:109] 1032366 1032366 999620 999602 845454 783717 779441 717557 577339 574363 ...
 $ edge_summary:'data.frame':   28 obs. of  2 variables:
  ..$ predicate: chr [1:28] "biolink:interacts_with" "biolink:has_phenotype" "biolink:expressed_in" "biolink:actively_involved_in" ...
  ..$ count    : int [1:28] 2483429 1460300 1229152 632984 551358 522698 431854 312598 298633 268302 ...
 $ total_nodes : int 1032366
 $ total_edges : int 8626674

Not 100% happy, but it's a start.

oneilsh commented 1 week ago

In other ideas, I'm prototyping a sampling method that grabs a diversity of predicates and node categories, satisfying two conditions: A) every predicate is represented, B) every category is represented

1) sample one edge of every predicate (and the connected nodes) 2) identify the set of categories not yet represented so far 3) additionally sample one node each of those missing categories 4) join all of the above into a sample graph and return it

This isn't perfect: