monarch-initiative / monarchr

R package for easy access, manipulation, and analysis of Monarch KG data
Other
6 stars 1 forks source link

Support file-based graphs from kghub #16

Open oneilsh opened 3 weeks ago

oneilsh commented 3 weeks ago

It would be nice to support both Neo4j-backed KGs and .tar.gz files hosted at https://kghub.io/ using a similar interface, but with different "engines" providing common features. For example:

g <- file_engine("mondo_kgx_tsv.tar.gz") %>%
        fetch_nodes(query_ids = c("MONDO:0007525", "MONDO:0007526"))

g <- neo4j_engine("https://neo4j.monarchinitiative.org")  %>%
        fetch_nodes(query_ids = c("MONDO:0007525", "MONDO:0007526"))

However, functions such as fetch_nodes() should return graphs, not engines, meaning subsequent queries would need to be provided an engine to use:

e <- file_engine("mondo_kgx_tsv.tar.gz")
g2 <- g %>% fetch_edges(e, predicates = "biolink:has_phenotype")

For a data-first approach, graphs can additionally carry with them the engine that generated them (or the most recent engine, in a multi-engine setting which should be feasible):

g2 <- g %>% fetch_edges(predicates = "biolink:has_phenotype")

By abstracting engines, we can also define a monarch_engine as a special case of neo4j_engine, and provide different implementations of features like search via the Monarch API.

I've made progress on this over in the engines branch, and am working through an MVP set of functions for the different engines:

constructors:

search_kg(engine) Keyword/string search

fetch_nodes(engine) Fetch nodes by ID list or property combination

fetch_edges(graph) Given a query graph, fetch more edges from the KG neighborhood (expand)

summarize_neighborhood(graph) Summarize the surrounding neighborhood in the KG without fetching it

Edit: I'll make a separate item for documenting as it's currently lagging the code:

oneilsh commented 3 weeks ago

@bschilder - while going through this higher-level API adjustment, I am also wondering if it's worth doing some function renaming to help with clarity.

Not sure how I feel about these yet, but just some ideas: *_engine could be *_kg, seach_kg could just be search (I didn't want to override the default non-generic in utils, but then I see other packages do so maybe not a big deal? Then again, if _engine becomes _kg then search_kg fits well). Meanwhile, fetch_nodes is only for engines (there's no query graph we are expanding outward from), while fetch_edges requires a graph to expand on, but the naming doesn't make that super clear. Maybe like fetch_kg_nodes and fetch_more?

bschilder commented 1 week ago

Awesome to see progress on this!

i like the idea of renaming to the suffix *_kg, seems more intuitive to me and less characters to write.

Regarding search, people do often overwrite these but I find this confusing (there's ambiguity as to which one you're calling when reading a script) and the messages that come up every time you load the package are annoying.

Another option might be to include an argument within fetch_nodes to expand the fetch to additional nodes that are a certain graph distance away: eg. 2-hop, 3-hop etc.

In the meantime, I'm going to merge your changes into my fork to test it out and give more specific feedback.