greenelab / connectivity-search-analyses

hetnet connectivity search research notebooks (previously hetmech)
BSD 3-Clause "New" or "Revised" License
8 stars 5 forks source link

Simultaneous query of multiple nodes #55

Open zietzm opened 7 years ago

zietzm commented 7 years ago

Add functionality in hetmech to query a set of nodes in order to return a ranked list of connection predictions along with the corresponding metapaths.

For example, if a set of genes were queried, we would want output of the form:

Predictions

Rank End Node Metapath DWPC p-DWPC r-DWPC
1 Compound A CbGpPpG ... ... ...
2 Disease B DtCcCcCbG ... ... ...
3 Anatomy C AlDtC<rG ... ... ...
... ... ... ... ... ...

In order to reduce computation time, it may be useful to cache DWPC matrices so that a set of query nodes (given as a vector) can be queried almost instantly (order ~ 100 microseconds).

It should be noted that after work done in #54 and #43 to add sparse matrix functionality, the computation time for DWPC over all 752 compatible Rephetio metapaths has been reduced from a total of 6.5 hours to 48 minutes! In fact, the longest computation time for DWPC over a single metapath is now around 35 seconds, while the average time is about 3.9 seconds (see below).

Caching the DWPC matrices for the 752 Rephetio metapaths using scipy.io.savemat saves 752 sparse matrices as a .mat file. The file size for all these matrices is 461 MB.


The histogram below shows the distribution of DWPC times over the 752 metapaths. sparse_rephetio_v2

zietzm commented 7 years ago

Numpy has built-in functionality for using multiple processor cores when performing the multiplications involved in DWPC. Scipy, however, does not natively support multi-processing for sparse matrix multiplication.

Following a suggestion by @dhimmel, I used a multi-core approach with sparse matrices to re-run the 752 hetmech-compatible Rephetio metapaths.

This cut the computation time down even more, to a total of just under 20 minutes.

The individual metapath computation times did not significantly change, and the maximum path computation time was 31 seconds (see below).


multicore_dwpc_times


Following yet another suggestion by @dhimmel, I used the concurrent.futures.ProcessPoolExecutor to run all paths in around 14 minutes. This method did show an increase in the maximum DWPC time to 70 seconds, but the distribution looks very promising (see below). Of the 752 paths, only 9 took longer than 32 seconds to run, and the vast majority (644/752) took less than five seconds.

I also attempted to use the concurrent.futures.ThreadPoolExecutor as a way to conserve memory while running, but this method actually at times used just as much memory, and was considerably slower than ProcessPoolExecutor.

concurrent_futures

kkloste commented 7 years ago

Nice work! Good to see our "efficient algorithm prototype" tested in a more high throughput setting with some positive results.

I still think we can bring down the runtime on the longest-running instances by using optimal matrix-chain-ordering.

zietzm commented 6 years ago
DWPC P/SD-DWPC Z-DWPC Information
0 0 NaN No information either way
Nonzero 0 Inf Some evidence of a connection. Need more info
0 Nonzero 0 Some evidence of no connection
Nonzero Nonzero Nonzero Evidence depends on Z-DWPC
dhimmel commented 6 years ago

We can probably set Z-DWPC in the first row (where all DPWCs are zero) to 0. The Inf situations will be harder to deal with because they are potentially of interest, but it will be difficult to rank them.

zietzm commented 6 years ago

p-dwpc-heatmap

Regarding the interpolation of remaining Z-DWPC infinities, I'm not sure we could interpolate based on nearby source and target degrees. We could use the maximum P-DWPC, and maybe the average SD-DWPC, in which case we may be overly conservative. Alternatively, we could do mean P-DWPC as well as mean SD-DWPC, in which case the results seem less conservative but still within the range of normal possibility.

For reference, here are the DWPC values for those paths having Z-DWPC = Infinity: image

Here are resulting Z-DWPCs for those paths having Z-DWPC = Infinity originally: