Optimize matrix chain multiplication

dhimmel commented 7 years ago

@kkloste brought up that the order chained matrix multiplications are executed affects efficiency: https://github.com/greenelab/hetmech/pull/49#issuecomment-312283251. Turns out there's a Wikipedia for this topic.

Numpy has an implementation in numpy.linalg.multi_dot. If we dig through the source code, there's a function _multi_dot_matrix_chain_order, which we could use to compute the optimal ordering.

We have to contend with the presence of mixed numpy and scipy.sparse arrays. Also, we may want to generate order just from the shape of the matrices, so we don't need all matrices in memory at the same time.

kkloste commented 7 years ago

I agree, generate the order using only the shapes of the matrices first.

zietzm commented 7 years ago

I have been running some profiles of various functions within degree_weight, and I have noticed that for dense matrices, the biggest time sink in many functions (eg: degree_weight.dwpc), is functools.reduce. This makes sense given the fact that some inefficient matrix multiplication ordering could be really slowing down our functions. I presume this will be the line (among others) that we want to address with this issue.

However, it doesn't appear that matrix multiplication is a significant time sink when it comes to sparse matrices. Instead, several odd sparse-only operations appear to be the biggest time uses. For example, I just traced back through several functions and found that the biggest time use in degree_weight.dwpc with sparse_threshold=1 (ie. sparse only throughout) was in conversion from csr to csc matrices. This seemed odd because we haven't been using csr at all (intentionally, at least). Turns out that in our current dwwc_step, performing degree-weighting on a sparse.csc adjacency matrix returns a sparse.coo_matrix. Subsequent multiplication of a csc and coo returns a csr matrix.

If I add to the line in matrix.auto_convert so that it reads

    elif not sparse.issparse(matrix) and not above_thresh:
        return sparse.csc_matrix(matrix)

then we avoid converting from csr to csc. As I mentioned in #47 earlier today, adding this line makes dwpc about 16% faster when using sparse matrices. However, I understand that csc may have been chosen for good reason, and that perhaps we want to actually use csc all the time. If so, we would need to edit something in degree_weight.dwwc_step to ensure we output csc.

The point of this is to say the following: it may not be necessary to deal with the matrix ordering when using sparse matrices because matrix multiplication of sparse is much faster as-is. If we want to save time with dense matrices, better chain multiplication seems very useful. For sparse, however, time may be better spent optimizing other parts of this that actually use a significant amount of time.

Hopefully this could save us some time as we might not need to

contend with the presence of mixed numpy and scipy.sparse arrays.

kkloste commented 7 years ago

We discussed a while back that we should have all the codes use a single sparse format. I'm not sure how we ended up returning a coo matrix -- is it because of something in matrix.normalize (which we call inside dwwc_step) ? I don't see how dwwc_step itself could cause conversion to coo.

In any case, great find @zietzm. The ideal situation is we have as few conversions as possible (those take time) -- so my recommendation is that we find the places where things get formatted as coo or csr , and change them so everything is csc.

kkloste commented 7 years ago

Regarding sparse matrices and the ordering of multiplication:

for the larger matrices in our data, the order of multiplication will probably still matter even for matrices that are pretty sparse.
If we create the code to optimally order multiplication of dense matrices, I imagine it shouldn't be difficult to apply that to the sparse matrices as well.

So with that in mind, I recommend we plan on applying the multiplication ordering to the sparse matrices as well as dense. But you make a good point, @zietzm , that it might not be worth while in all sparse cases to apply the matrix-ordering subroutines.

zietzm commented 7 years ago

find the places where things get formatted as coo

This line returns a matrix in the coo format. Am working out how to do this without having to convert matrix types. It appears that matrix * vector is the same as matrix @ vector when dealing with sparse matrices.

kkloste commented 7 years ago

I think that is consistent with what @dhimmel found a couple months ago when looking at the subtle differences across different ways of calling the algebraic operation of "multiply these two arrays together"

zietzm commented 7 years ago

In trying to find a workaround for matrix.normalize so that only sparse.csc_matrixs are given as output, I have found a way that I think should work. However, when running all the tests, I noticed that it fails a test in test_diffusion. Specifically, it appears that my method makes a copy in numpy, which causes test_diffusion.TestDualNormalize.test_diffusion_step_copy to fail when the input dtype=numpy.float64.

My method for matrix-vector multiplication is essentially the following:

If multiplying by a column vector, transpose the matrix
Make the vector a sparse row vector
Multiply the matrix and vector
Transpose the matrix.

This method keeps all the outputs as csc and gets the proper data. Additionally, it isn't nearly as slow as converting directly from coo_matrix or csr_matrix to csc. However, I cannot get the ids to be the same.

In numpy or scipy, matrix.T.T is matrix returns False, even though in the scipy documentation and source code, it appears that the default argument is copy=False. Even when I add this, I still don't get the same object id.

# matrix is a sparse.csc_matrix
matrix.transpose(copy=False).transpose(copy=False) is matrix
>>> False

In fact, digging through the source code and seeing how transpose works for sparse matrices, I see that the copy argument doesn't seem to really be that useful for sparse matrices. For example,

# matrix is a sparse.csc_matrix
matrix is sparse.csc_matrix(matrix, copy=False)
>>> False

# and yet
type(matrix) is type(sparse.csc_matrix(matrix, copy=False))
>>> True

I don't know if this was intentional on Scipy's end, but this seems very un-intuitive.

My issue now is that I have a method that corrects the matrix type in matrix.normalize, but fails to not copy the matrix when copy=False and dtype=numpy.float64. I have traded one problem for another.

Should I try to find a different way to correct the matrix type in multiplication so that we can maybe do it all in place, or should we use this new method despite the fact that it makes a copy of the matrix? Maybe it comes down to which is more important for performance, in-place but with conversion or with-copy but without conversion.

kkloste commented 7 years ago

It seems to me we can avoid the problem if we changed the way we multiply matrix and vector in this code block. Instead of using vector as a dense vector, we ought to simply convert it to a sparse diagonal matrix via diagvector = scipy.spdiags( vector, [0] ) .

I believe then matrix @ diagvector will return a matrix of the desired format regardless of whether matrix is sparse or dense?

kkloste commented 7 years ago

Looking at our previous code that uses this functionality, I had the usage incorrected -- it should be diagvector = scipy.sparse.diags(vector)

zietzm commented 7 years ago

That makes sense, and will probably be much faster than the way I just came up with! Great idea @kkloste! Just one thing, I think we will run into the same problem with the test_diffusion_step_copy. I just implemented this method, and I don't see a clear way to keep from copying the matrices.

zietzm commented 7 years ago

matrix @ sparse.diags(vector) give the correct data for column normalization in the correct format. sparse.diags(vector) @ matrix gives the correct data for row normalization, but in a sparse.csr_matrix format.

If we do (matrix.T @ sparse.diags(vector)).T then we get the correct data for row normalization in the correct format.

kkloste commented 7 years ago

Good catch with csr_matrix.

Can you point to the line number where a matrix is being copied? I'm not sure what you mean.

zietzm commented 7 years ago

So the error is being thrown here. Which is to say that despite us having copy=False and dtype=numpy.float64, input_matrix is not matrix.

I think the break where a copy is made is here if we are implementing the new method of multiplication. Under the old method, the test passed so the two did actually point to the same object. I'm not sure how we could do this now, though.

The error is arising from the fact that we no longer have matrix *= vector in the multiplication. matrix *= vector is not the same in python as matrix = matrix * vector, because the former multiplies in place. So one thing to note is that even under the old system, the test actually worked only when dealing with dense matrices. Had we multiplied by a sparse in a test, the test would have failed as matrix.multiply(vector) does not work in place.

dhimmel commented 7 years ago

@zietzm I wouldn't worry too much about the copy=False creating a new object. The argument is really only there to guarantee the input matrix doesn't get edited when copy=True. You can relax the tests.

dhimmel commented 7 years ago

Tagging @mmayers12. Have you considered matrix chain multiplication order? Our first priority is getting the most general matrix-DWPC implementation, but this would be a follow up optimization.

mmayers12 commented 7 years ago

I haven't really needed to look into this optimization yet. Most of the multiplications are really fast (200-300 ms per metapath). This includes all the fancy removal of diagonal steps in between. Whenever possible, I've used np.prod to multiply a list of sparse matrices, so I don't know if it has any of this kind of optimization built into it. Finally, all matrices in our implementation are square, so I'm not sure if this is an issue, either.

A couple notes on speed optimizations I've done: All of the matrices (when working with Rephetio) are square matrices of size 47,031 of type scipy.sparse.csc_matrix. These are very fast to work with, and sparse really does a good job of keeping them very small in memory. Holding the 24 adjacency matrices from Rephetio and 25 degree weighted adj. matrices uses 0.7 GB of RAM. When not using parallel processing, the multiplication steps of calculation for all 1206 metapaths took 19 min and 50 sec (so just under 1 sec per metapath) and the RAM requirements topped out at 7.6 GB. Upping to 16 cores reduces this time to around 3 min, but increases the RAM requirement to at least 35 GB.

During my tests with sparse matrices I found that most operations are fast:

Sparse to dense: Fast
csc to lil: Fast
lil to csc: Fast
dia to csc: Fast
dense to sparse: SLOW
vector to diags: FAST

Before doing these tests I was generating matrices by adding 1s to a zeroes matrix first, then converting to sparse. This took about 20min for the 24 different edges. These tests led me to use diags(np.zeroes((1, size)).tolil() to generate initial empty adjacency matrix. 1's can quickly be added to an lil_matrix . This brought the time down to about 1min 30sec to generate the same 24 adjacency matrices.

greenelab / connectivity-search-analyses

Optimize matrix chain multiplication #53