Closed dhimmel closed 7 years ago
@zietzm let's aim to keep backwards compatability although if you think a backwards-incompatible API change is warranted that is also fine. Just note that it's a good habit to consider whether any change set is a breaking change or not, i.e. if existing codebases upgrade to the new version will they potentially break?
Regarding tests, I'd use the example hetnet used here. @zietzm perhaps the best way to start is to write tests for the current metaedge_to_adjacency_matrix
in an initial pull request. Once that's merged, we can move to scipy.sparse support. FYI, we use pytest for testing.
Slightly related protip: write tests first, then add feature. This way you verify tests fail before feature is added. Otherwise, if all tests pass you don't know if its because the feature worked or just that the tests never fail. We haven't followed this convention up to this point, but it's something to keep in mind.
Nice to meet you, @zietzm
That's a great test-writing tip, thanks Daniel!
Let me know if I can be of any help here?
Now that I'm back on task in this project, implementing an option in metaedge_to_adjacency_matrix
is a priority for me -- it should make a lot of the matrix methods we've been exploring, including the current implementation of dwpc
in path_county.py
,more efficient.
As a general rule, @dhimmel , now that I've seen the data more up-close I can say that these metaedge adjacency matrices absolutely should be handled using sparse data structures:
with the current implementation of metaedge_to_adjacency_matrix
using dense matrices, attempting the matrix multiplies in dwpc
used GBs of RAM, and eventually caused the computer to run out of memory.
Switching to sparse matrices enables computation using far less -- I don't know exactly, but my guess is < 50 MB used for the matrix operations involved compared to the 10+ GB used otherwise.
@zietzm created tests for the current metaedge_to_adjacency_matrix
function which returns a in numpy.array
in https://github.com/greenelab/hetmech/pull/39.
The next step will be make metaedge_to_adjacency_matrix
more flexible so it can return a variety of array backends including the scipy.sparse
options. See the line that creates the numpy.array
. This is what we need to make modular.
@zietzm does the task make sense?
@dhimmel Got it. Should there be an option of what type of sparse matrix to use (csr, csc, etc.)? I see we had been wanting to use csc in #13
Should there be an option of what type of sparse matrix to use (csr, csc, etc.)?
Yes. Ideally, the APIs are similar enough that we can support all scipy.sparse options. If the APIs differ, we'll have to decide whether the added complexity is worth it.
@zietzm I'm thinking the best design will be to allow the user to pass the type. For example, the default will look like:
metaedge_to_adjacency_matrix(graph, metaedge, dtype=numpy.bool_, matrix_type=numpy.array)
So you could change the matrix type like:
metaedge_to_adjacency_matrix(graph, metaedge, matrix_type=scipy.sparse.csr_matrix)
You can check if something is a scipy.sparse by seeing whether it's a subclass of scipy.sparse.spmatrix
.
@kkloste meet @zietzm (undergrad at Penn starting to work in the Greene Lab). From now until the summer, I'm planning to deploy @zietzm on several small (potentially random) tasks to build his development skills.
In https://github.com/greenelab/hetmech/issues/13 we discussed strategically using scipy.sparse matrices for efficiency. A good first step will be enhancing
metaedge_to_adjacency_matrix
to optionally return a scipy.sparse matrix.I think this will be a good task for @zietzm. We probably want to allow the user to choose which matrix backend to use, e.g. numpy.array, numpy.matrix (potentially), or any of the scipy.sparse matrices.
@zietzm would be nice if you add tests to the non-existent
test_matrix.py
.