greenelab / connectivity-search-analyses

hetnet connectivity search research notebooks (previously hetmech)
BSD 3-Clause "New" or "Revised" License
9 stars 5 forks source link

Add sparse matrix option in metaedge_to_adjacency_matrix #36

Closed dhimmel closed 7 years ago

dhimmel commented 7 years ago

@kkloste meet @zietzm (undergrad at Penn starting to work in the Greene Lab). From now until the summer, I'm planning to deploy @zietzm on several small (potentially random) tasks to build his development skills.

In https://github.com/greenelab/hetmech/issues/13 we discussed strategically using scipy.sparse matrices for efficiency. A good first step will be enhancing metaedge_to_adjacency_matrix to optionally return a scipy.sparse matrix.

I think this will be a good task for @zietzm. We probably want to allow the user to choose which matrix backend to use, e.g. numpy.array, numpy.matrix (potentially), or any of the scipy.sparse matrices.

@zietzm would be nice if you add tests to the non-existent test_matrix.py.

dhimmel commented 7 years ago

@zietzm let's aim to keep backwards compatability although if you think a backwards-incompatible API change is warranted that is also fine. Just note that it's a good habit to consider whether any change set is a breaking change or not, i.e. if existing codebases upgrade to the new version will they potentially break?

dhimmel commented 7 years ago

Regarding tests, I'd use the example hetnet used here. @zietzm perhaps the best way to start is to write tests for the current metaedge_to_adjacency_matrix in an initial pull request. Once that's merged, we can move to scipy.sparse support. FYI, we use pytest for testing.

Slightly related protip: write tests first, then add feature. This way you verify tests fail before feature is added. Otherwise, if all tests pass you don't know if its because the feature worked or just that the tests never fail. We haven't followed this convention up to this point, but it's something to keep in mind.

kkloste commented 7 years ago

Nice to meet you, @zietzm

That's a great test-writing tip, thanks Daniel!

Let me know if I can be of any help here?

kkloste commented 7 years ago

Now that I'm back on task in this project, implementing an option in metaedge_to_adjacency_matrix is a priority for me -- it should make a lot of the matrix methods we've been exploring, including the current implementation of dwpc in path_county.py,more efficient.

As a general rule, @dhimmel , now that I've seen the data more up-close I can say that these metaedge adjacency matrices absolutely should be handled using sparse data structures: with the current implementation of metaedge_to_adjacency_matrix using dense matrices, attempting the matrix multiplies in dwpc used GBs of RAM, and eventually caused the computer to run out of memory.

Switching to sparse matrices enables computation using far less -- I don't know exactly, but my guess is < 50 MB used for the matrix operations involved compared to the 10+ GB used otherwise.

dhimmel commented 7 years ago

@zietzm created tests for the current metaedge_to_adjacency_matrix function which returns a in numpy.array in https://github.com/greenelab/hetmech/pull/39.

The next step will be make metaedge_to_adjacency_matrix more flexible so it can return a variety of array backends including the scipy.sparse options. See the line that creates the numpy.array. This is what we need to make modular.

@zietzm does the task make sense?

zietzm commented 7 years ago

@dhimmel Got it. Should there be an option of what type of sparse matrix to use (csr, csc, etc.)? I see we had been wanting to use csc in #13

dhimmel commented 7 years ago

Should there be an option of what type of sparse matrix to use (csr, csc, etc.)?

Yes. Ideally, the APIs are similar enough that we can support all scipy.sparse options. If the APIs differ, we'll have to decide whether the added complexity is worth it.

dhimmel commented 7 years ago

@zietzm I'm thinking the best design will be to allow the user to pass the type. For example, the default will look like:

metaedge_to_adjacency_matrix(graph, metaedge, dtype=numpy.bool_, matrix_type=numpy.array)

So you could change the matrix type like:

metaedge_to_adjacency_matrix(graph, metaedge, matrix_type=scipy.sparse.csr_matrix)

You can check if something is a scipy.sparse by seeing whether it's a subclass of scipy.sparse.spmatrix.