greenelab / connectivity-search-analyses

hetnet connectivity search research notebooks (previously hetmech)
BSD 3-Clause "New" or "Revised" License
9 stars 5 forks source link

DWPC implementation coverage of the Project Rephetio metapaths #47

Closed dhimmel closed 6 years ago

dhimmel commented 7 years ago

We currently have implemented the DWPC (see #45) for certain metapaths. The question is, of the 1,206 metapaths in Project Rephetio, how many will our current implementation succeed on?

You can get the metapath info here or here. An interactive metapath browser for Project Rephetio is also available at http://het.io/repurpose/metapaths.html

zietzm commented 7 years ago

454 / 1206 metapaths will throw an error when running dwpc. This error stems from get_segments in which an error is raised if there are overlapping segments of duplicate metanodes.

I am currently running the full dwpc function for all the 752 metapaths between all compounds and diseases that will run in our current implementation. From this, I hope to compare predictions to Rephetio's predictions, at least for this subset of metapaths. I think it would be really interesting to see where the metapaths we don't support fall in the ranking of metapaths by AUROC and coef. It may be useful to know if the paths we don't support with hetmech are informative or not.

dhimmel commented 7 years ago

454 / 1206 metapaths will throw an error when running dwpc

@zietzm can you submit a pull request with this notebook?

zietzm commented 7 years ago

Sure thing! It is running something at the moment, but I will pull request as soon as it is complete.

kkloste commented 7 years ago

I've made some progress this week on the "two repeated metanodes, interleaved" case this week. Specifically the "ABAB" sub-case ("BAAB" requires a different solution, I believe). Hopefully this will bring down that "454" to a lower number.

zietzm commented 7 years ago

adjacency_matrix_size

I wanted to reference the relative sizes here, so I put the table in above. I will hopefully have one soon that will show the time of multiplication for all of the relevant metaedge combinations.

Edit: Here it is (top 20 times).

zietzm commented 7 years ago

Another random note: Within the auto_convert function, it may be best to add and not sparse.issparse(matrix), as this reduces dwpc runtime using sparse matrices by about 16 percent. It looks like there is some automatic conversion being done somewhere in the function that converts csc to csr. This was the biggest time use in dwpc when running with a sparse_threshold=1 or only using sparse matrices.

add_issparse_line_auto_convert

Edit: As mentioned in #53, the error traces back to degree_weight.dwwc_step in which degree-weighting the adjacency matrix outputs a sparse.coo_matrix, and when we multiply that by a csc_matrix we output a csr. We can either ignore this by allowing csr to pass (as in the code above), or we could change the function so that it correctly outputs csc.

Edit2: This issue is the purpose of #54

zietzm commented 7 years ago

In comparing the Rephetio DWPC results with hetmech's results, I discovered that there is disagreement between the methods for about 13.4 percent of the metapaths. I haven't been able to trace this error back to a source error. However, I was using an example path with specific nodes and edges, that I think could help our arrival at a solution.

I was looking at the path 'CrCrCrCtD', as suggested by the hetnet folks at Scripps. I noticed a point of disagreement between the nodes "DB00352" and "DOID:2531". I loaded a Neo4j instance and by hand saw that there should not be a 'CrCrCrCtD' path between them. Rephetio correctly gave a path count of 0 for this instance. Hetmech returned 1.

Here is the graph: neo0

My hypothesis is that traversal is somehow going through metanodes CCCCD by going through nodes abcbd. I am not sure about this analysis, but I think that because we stayed within the same metanode and the segments created were therefore CrCrCrC and CtD, we still used a walk count within CrCrCrC, as there is no method within the code to be sure that the same node of the same metanode within a segment is not repeated.

Another example that I think potentially strengthens this hypothesis is that another PC=0 prediction from Rephetio is in the graph below. On this one, hetmech scored PC=2. I think the one-edge difference in the scheme of the below graph and the above graph fits the picture I have.

neo4j-1

I'm not sure where this leaves us. @kkloste, is there a potential solution for this since we appear to be dealing with nodes instead of metanodes? Perhaps we could do something like a series of subsegmentations for a repeated metanode sequence?

dhimmel commented 6 years ago

See the following:

Here's the breakdown of how rephetio metapaths were classified for path counting:

short_repeat    599
BABA            278
BAAB            144
disjoint        131
other            32
no_repeats       18
long_repeat       4