Assessing path and intermediate node contributions to a node-pair search

@ben-heil and I are meeting presently to discuss potential projects for his rotation in the @greenelab. One thing that will be important for the search engine we're building, where users select a node pair and we identify paths that occur more frequently than is expected by chance, is to identify not just metapaths, but also specific paths and intermediate nodes that are relevant.

For example, the hetmech-backend database, which is still populated, returns the following most-significant metapaths between the gene FTO and disease obesity:

metapath_id  path_count      dwpc       p_value  source_degree  target_degree  n_dwpcs  n_nonzero_dwpcs  nonzero_mean    nonzero_sd
  DaGpBPpG         435  2.814122  3.932283e-08            373             32    29000            29000      2.095634  1.214048e-01
   DaGeAeG        6204  2.002286  7.400247e-08            373             28    53000            53000      1.868725  2.485515e-02
   DpSpDaG          25  4.434438  1.337052e-04             17              6   101000           100994      2.438776  4.514481e-01
     DrDaG           3  5.138905  2.442112e-03              5              6   181800            32414      3.920578  5.135883e-01
   DlAlDaG          42  3.744022  7.010120e-03             33              6    20200            20200      2.726263  3.786078e-01

The challenge is to further decompose these DWPCs into individual path scores. Once that is accomplished, and path scores can be compared across metapaths, we can even aggregate scores by intermediate nodes, as we have briefly explored previously in Decomposing the DWPC to assess intermediate node or edge contributions.

So the main tasks here would seem to be:

Going from DWPC to path score, which can be accomplished using a neo4j cypher query and is largely already implemented in some form or another.
Finding a method to assign an overall weight to a metapath, such that paths from different metapaths can be ranked according to a common score.
Identifying how this approach can fit within the hetmech search engine, which likely needs an implementation that is near immediate in human time.

greenelab / connectivity-search-analyses

Assessing path and intermediate node contributions to a node-pair search #150