monarch-initiative / monarchr

R package for easy access, manipulation, and analysis of the Monarch Initiative or other KGX-formatted knowledge graphs.
https://monarch-initiative.github.io/monarchr/articles/monarchr
Other
9 stars 1 forks source link

Add utility function to compute single-graph semantic similarity #41

Open bschilder opened 2 months ago

bschilder commented 2 months ago

A very common thing users might want to do is to compute the semantic similarity between nodes in a graph and then store that data back in the edges of the graph (to use as edge weights later).

I've created the following function to automate this.

#' Add semantic similarity
#'
#' First computes semantic similarity between all pairs of nodes in a graph.
#' Then adds the continuous similarity score as an edge attribute.
#' @param sim_fun The similarity function to use.
#'  Default is link[igraph]{similarity}.
#' @param sim_col Name of the new edge attribute to store the similarity score.
#' @param ... Additional arguments passed to the similarity function
#'  (\code{sim_fun}).
#' @returns Graph object with similarity added as a new edge attribute.
#' @export
#' @examples
#' filename <- system.file("extdata", "eds_marfan_kg.tar.gz", package = "monarchr")
#' g <- file_engine(filename) |>
#'           fetch_nodes(query_ids = "MONDO:0007525") |>
#'           expand(predicates = "biolink:has_phenotype",
#'                  categories = "biolink:PhenotypicFeature")|>
#'           expand(categories = "biolink:Gene")
#' g <- graph_semsim(g)
#' edges(g)$similarity
graph_semsim <- function(graph,
                        sim_fun=igraph::similarity,
                        sim_col="similarity",
                        ...){
    from <- to <- NULL;
    message("Computing pairwise node similarity.")
    X <- sim_fun(graph, ...)
    rownames(X) <- colnames(X) <- igraph::V(graph)$name
    graph <- graph|>
        activate(edges)|>
        dplyr::mutate(!!sim_col:=purrr::map2_dbl(from, to, ~ X[.y, .x]))
    return(graph)
}
bschilder commented 2 months ago

Of course, this won't yield the "true" semantic similarity between the nodes unless your graph is sufficiently large and complete to accurately characterise the relationships between the nodes. But still, can come in handy.

bschilder commented 2 months ago

monarch_semsim does something a bit like this, but instead compares two graph to each other. Is it possible to query the semantic similarity API for a single graph instead (so we can get accurate similarity scores between all edge-connected nodes within the graph)? Might be a nice complement to my graph_semsim function which only considers the local graph instance.

oneilsh commented 2 months ago

The Monarch semsim API does take two sets of node IDs, and computes the best match from each in set A to those in set B, and vice-versa (essentially designed to support https://monarchinitiative.org/explore#phenotype-explorer).

I suppose for a single graph we could just pass the nodes as both set A and set B, but the functionality giving just the best match means that each node will just be reported to match itself (I think). Perhaps this is a feature request for the API - all-vs-all semantic similarity queries. Could be intensive though given the O(n^2) nature of the result. Tagging @kevinschaper

bschilder commented 1 month ago

The Monarch semsim API does take two sets of node IDs, and computes the best match from each in set A to those in set B, and vice-versa (essentially designed to support https://monarchinitiative.org/explore#phenotype-explorer).

I suppose for a single graph we could just pass the nodes as both set A and set B, but the functionality giving just the best match means that each node will just be reported to match itself (I think). Perhaps this is a feature request for the API - all-vs-all semantic similarity queries. Could be intensive though given the O(n^2) nature of the result. Tagging @kevinschaper

Yeah, instead of returning only the top 1 similar node I'd want to return each node's similarities with each other node. I agree, for large graphs this would be a massive computation. Perhaps precomputing this and storing it as a separate database (version-controlled and regenerated for each KG release). Is this something that would be feasible @kevinschaper ?