greenelab / connectivity-search-analyses

hetnet connectivity search research notebooks (previously hetmech)
BSD 3-Clause "New" or "Revised" License
9 stars 5 forks source link

Coordinating matrix DWPC implementation with Scripps folks #58

Open dhimmel opened 7 years ago

dhimmel commented 7 years ago

Several individuals in the Su Lab at Scripps Research Institute, specifically @veleritas, @NuriaQueralt and @mmayers12, are working on extensions to Rephetio and Hetionet. We're going to catch up on a call Wednesday July 12, 2017 from 1-2 PM. As an aside, @cgreene @zietzm @danich1 @kkloste, feel free to join this call, even if just as background audio. Since we want to focus the call on more high level planning, @mmayers12 suggested we take a look at his work on implementing matrix DWPC beforehand.

You can see the current progress in the mmayers12/hetnet-ml repo and specifically in matrix_tools.py. It looks like @mmayers12 has arrived to several of the same decisions and algorithms we've implemented here. For example, the DWWC and DWPC distinction.

For reference, the work in hetmech currently consists of three primary contributors:

It would be great if @mmayers12, @kkloste, @zietzm and I could all get on the same page regarding how the mmayers12/hetnet-ml and greenelab/hetmech implementation differ. My hope is that @mmayers12 will be able to contribute his advances to this repo. Specifically, we should compare our matrix-DWPC algorithms which are discussed specifically in #52, #47, #45 (merged), and #20 (outdated). Our current implementation is in degree_weight.py. In short, @kkloste has figured out how to exclude duplicate-node paths for some metapaths but not all.

There is also the larger question of whether DWPC is even needed, or whether DWWC is sufficient as comparison to permuted-DWWCs could correct for duplicate-node paths. Finally, Thinklab is shutting down and transitioning to a read-only state :disappointed:, hence the new URL of https://think-lab.github.io.

cgreene commented 7 years ago

Are times EST? If EST I will do my best to join but I am likely to be a few minutes late. Still, the call is long enough that I should hit most of it :smile:

dhimmel commented 7 years ago

Are times EST? If EST I will do my best to join but I am likely to be a few minutes late.

Ah yes, enter the call via https://zoom.us/j/936884443. The call is scheduled from 1:00-1:50 PM EST, however, I'm free afterwards if anyone wants to have break out discussions.

kkloste commented 7 years ago

I can talk for most of the 1:00-1:50pm window. Nice to meet yall!

mmayers12 commented 7 years ago

Hi everyone!

I've had a chance to look through the discussions and algorithms posted and it looks like we have both come to similar conclusions independently. It seems an exact DWPC solution can be found via matrix multiplication when only one metanode is repeated in a metapath, up to 3 times (are you able to accurately calculate CrCrCrCtD?), or when two metanodes are repeated in either a non-overlapping or entirely-overlapping fashion.

The focus of the hetnet-ml repository is to provide fast and efficient extraction of DWPC (and potentially other) features for a machine learning pipeline like that of Rephetio. This would allow for experimentation that may require many multiple runs through such a pipeline like testing new data-sources, k-fold cross validation, or testing predictions utilizing other gold standards.

To this end, things that are quite expensive in terms of computation time, like building the adjacency matrices and weighting by degree, are performed only one time. This way, only the matrix multiplication needs performed to retrieve any one DWPC. This results in a computation time of about 1-2 seconds for the densest matrices, with the majority computing in tens of milliseconds (scipy.sparse is really a godsend in this respect).

Finally, I've devised a method to quickly estimate a solution when the DWPC for a metapath cannot be calculated exactly via matrices. I have tested this estimation method in comparison the features originally extracted by @dhimmel using neo4j, and the results produced are almost identical (I just added a few old test notebooks to our repo so you can take a look). You can also see that using the DWWC for the same metapaths completely over-fits the model, with coefficients for metapaths containing a CtD edge being overvalued.

I'm happy to talk and answer any questions you may have about our implementation ahead of our meeting on Wednesday, otherwise we can discuss it at that time.

dhimmel commented 7 years ago

are you able to accurately calculate CrCrCrCtD

Our current implementation will return results, i.e. this wouldn't raise a NotImplementedError. The CrCrCrC portion of the metapath would be computed separately. After each multiplication, the diagonal would be subtracted. @zietzm did you check whether the path counts equal the Rephetio path counts for this metapath?

when two metanodes are repeated in either a non-overlapping or entirely-overlapping fashion.

We'll set up a chat for @mmayers12 and @kkloste to come to consensus here.

This would allow for experimentation that may require many multiple runs through such a pipeline like testing new data-sources, k-fold cross validation, or testing predictions utilizing other gold standards.

Those are important use cases! We're mostly pursuing unsupervised approaches like given a query set of nodes of the same type, can we detect a metapath-target_node combination that best arrives at the query set. We're also interested in translating node sets to different types, e.g. symptoms to pathways.

To this end, things that are quite expensive in terms of computation time, like building the adjacency matrices and weighting by degree, are performed only one time.

@zietzm is building the adjancy matrices and degree normalization expensive in our implementation? I thought they were pretty quick. But yes, one exciting aspect of the matrix approaches is the ability to cache lot's of the intermediate computations.

I'm happy to talk and answer any questions you may have about our implementation ahead of our meeting on Wednesday

@mmayers12 we'll make sure to include you in relevant discussions here. Since I think we have a pretty good idea of the next steps regarding matrix DWPC, we'll probably devote most of our wednesday meeting to less certain or more general discussion.

kkloste commented 7 years ago

are you able to accurately calculate CrCrCrCtD

Yes, the algorithm we implement provably works for any number of repetitions (of a single node), even with other metanodes between, e.g. CrXeYbCrCtCeCa...eC (I'm making up the connecting letters like 'e')

when two metanodes are repeated in either a non-overlapping or entirely-overlapping fashion.

Just to make sure we're on the same page --

Our algorithmic framework can handle all three, but currently the only implementation that's finished is for the non-overlapping case.

mmayers12 commented 7 years ago

The CrCrCrC portion of the metapath would be computed separately. After each multiplication, the diagonal would be subtracted.

This is the same algorithm we've used, however, it is not perfect. Take for instance the example of CrCrCrCtD, with nodes a, b, c, d all of type C. By subtracting the diagonal at each step, you effectively remove any paths that start and end on the same node (a-x-y-a, b-x-y-b, etc.). However as the path gets longer, paths that start on a, but have other cycles, like b-c-b would not be eliminated. Because CrC does not have any self-references in it (the diagonal is zero), you will be free from a-b-b or a-c-c cycles, but as the number of consecutive visits to the C metanode increases, a-b-c-b and a-c-b-c paths show up, resulting in over-counting.

Just to make sure we're on the same page --

  • entirely-overlapping: CrBrBrC
  • non-overlapping: CrCtBrB
  • partially-overlapping case: CrBtCrB

Yes, that's exactly what I meant. Through my testing, I have not been able to find an exact solution to the partially-overlapping case. Looking at CrBtCrB you can eliminate the redundant paths due to vising C twice, or visiting B twice. However, eliminating both simultaneously is problematic.

For example, if you look at the over-counting due to vising C: multiply CrB * BtC, then take only the diagonal from the result and multiply by CrB (diagonal * CrB here, since multiplication order is important). Similarly you can get the can get the over-counting due to visiting B twice: take the diagonal from BtC * CrB then multiply CrB * diagonal2. Now if you subtract both of these results from the simple walk-count (Crb * BtC * CrB) you've eliminated all paths with redundant nodes, but you've doubly subtracted paths with redundant nodes of both types (e.g. given a, b of type C and x, y of type B, paths a-y-a-x and b-y-a-y have been properly removed, but paths a-x-a-x and b-y-b-y have been doubly removed). The other option, instead of subtracting both over-counting results from the walk count, is to take an element-wise maximum of the two over-counting matrices, and subtract that from the walk-count. While this will not over-subtract for paths a-x-a-x and b-y-b-y, it will under-remove for a to x when paths like a-y-a-x and a-x-b-x both exist.

So now we have a method that overestimates the error and a method that underestimates it. In order to get a fast approximation for DWPC in these cases, we take an element-wise average of the two errors matrices (sum and element-wise max) to get something that's fairly close to the true DWPC while still calculating the result in under 5 seconds. If you have a way that comes to an exact answer every time, I'd be really excited to see it!

Eventually, I would like to be able to look at metapaths longer than 4 steps, however, with that comes more complicated cases of repeated and overlapping metanodes.

is building the adjancy matrices and degree normalization expensive in our implementation? I thought they were pretty quick.

In my experience, the larger ones take 10-20 seconds to build. When your multiplication finishes on the ms time-scale, you don't want to have to build them every time you extract DWPC for a new metapath.

@mmayers12 we'll make sure to include you in relevant discussions here. Since I think we have a pretty good idea of the next steps regarding matrix DWPC, we'll probably devote most of our wednesday meeting to less certain or more general discussion.

Sounds good to me! The rest of the lab is working more on the data side of things, and I know it can be a bit tedious to go over the subtilies matrix operations with those who aren't working on it daily.

kkloste commented 7 years ago

Regarding metapaths that repeat the same node more than three times --- Ach, I didn't see it before in our analysis, thank you for pointing this out to me. I had believed I proved our method would work for any number of repetitions. We had handled other problems, like the x-a-y-a type repetitions, but do not properly handle x-x-x-x. @zietzm -- this explains the problem you just posted about here.

Looking at CrBtCrB you can eliminate the redundant paths due to vising C twice, or visiting B twice. However, eliminating both simultaneously is problematic.

This is exactly the problem we ran into a couple weeks ago. Accounting for loops boils down to the inclusion-exclusion principle: we want to subtract the set of walks with loops in C and the set of wwalks with loops in B but we end up subtracting twice the set of walks with loops in both B and C. Our solution is to:

  1. subtract the walks visiting C twice (which inadvertently subtracts walks that visit both C and B)
  2. subtract the walks visiting B twice (which inadvertently subtracts walks that visit both C and B)
  3. compute just the walks that visit both B and C twice, and add those one time (since we double subtracted them in steps 1 and 2, and really we only want to subtract them once).

This approach can generalize to more repeated nodes, but it gets messier. We haven't performed the generalization yet.