greenelab / connectivity-search-analyses

hetnet connectivity search research notebooks (previously hetmech)
BSD 3-Clause "New" or "Revised" License
8 stars 5 forks source link

Assessing path and intermediate node contributions to a node-pair search #150

Open dhimmel opened 5 years ago

dhimmel commented 5 years ago

@ben-heil and I are meeting presently to discuss potential projects for his rotation in the @greenelab. One thing that will be important for the search engine we're building, where users select a node pair and we identify paths that occur more frequently than is expected by chance, is to identify not just metapaths, but also specific paths and intermediate nodes that are relevant.

For example, the hetmech-backend database, which is still populated, returns the following most-significant metapaths between the gene FTO and disease obesity:

metapath_id  path_count      dwpc       p_value  source_degree  target_degree  n_dwpcs  n_nonzero_dwpcs  nonzero_mean    nonzero_sd
  DaGpBPpG         435  2.814122  3.932283e-08            373             32    29000            29000      2.095634  1.214048e-01
   DaGeAeG        6204  2.002286  7.400247e-08            373             28    53000            53000      1.868725  2.485515e-02
   DpSpDaG          25  4.434438  1.337052e-04             17              6   101000           100994      2.438776  4.514481e-01
     DrDaG           3  5.138905  2.442112e-03              5              6   181800            32414      3.920578  5.135883e-01
   DlAlDaG          42  3.744022  7.010120e-03             33              6    20200            20200      2.726263  3.786078e-01

The challenge is to further decompose these DWPCs into individual path scores. Once that is accomplished, and path scores can be compared across metapaths, we can even aggregate scores by intermediate nodes, as we have briefly explored previously in Decomposing the DWPC to assess intermediate node or edge contributions.

So the main tasks here would seem to be:

  1. Going from DWPC to path score, which can be accomplished using a neo4j cypher query and is largely already implemented in some form or another.
  2. Finding a method to assign an overall weight to a metapath, such that paths from different metapaths can be ranked according to a common score.
  3. Identifying how this approach can fit within the hetmech search engine, which likely needs an implementation that is near immediate in human time.
dhimmel commented 5 years ago

For background reading:

  1. Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes
    Daniel S. Himmelstein, Sergio E. Baranzini
    PLOS Computational Biology (2015-07-09) https://doi.org/98q
    DOI: 10.1371/journal.pcbi.1004259 · PMID: 26158728 · PMCID: PMC4497619

  2. Systematic integration of biomedical knowledge prioritizes drugs for repurposing
    Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini
    eLife (2017-09-22) https://doi.org/cdfk
    DOI: 10.7554/elife.26726 · PMID: 28936969 · PMCID: PMC5640425