Closed haasad closed 7 years ago
The same issues applies when factorizing a matrix using any other solver. The returned value from the factorize
call will assume that the A
matrix doesn't change.
On some level, this is an inherent problem with always doing factorization. If this was optional, then it would be the users fault if they used a "stale" factorization. But factorization is so fast and convenient with Pardiso that it would be nice to keep it.
By the way, more or less any modifications can trigger this behaviour, so this (could) be a real issue:
In [1]: from scipy.sparse import *
In [2]: m = lil_matrix((2, 2))
In [3]: id(m)
Out[3]: 4375863928
In [4]: m[:, 0] = 1
In [5]: id(m)
Out[5]: 4375863928
In my opinion, the ideal solution would be to not store the entire matrix, but some piece of information that includes all matrix data. I haven't found anything like a fast way to hash sparse matrices, but maybe there is something clever around.
For CSR matrices, you could just store the .data
array, and 1) compare the shape; 2) sum the absolute value of the difference from old to new. This should handle the case where values are switched but the matrix sum stays the same.
See also bw2calc issue.
This behaviour in my opinion makes sense for the factorized method and there the user should be responsible. The real problem is, that this behaviour also applies to spsolve, where the user should expect that no "stale" factorization is applied when he modified matrix A.
For CSR matrices, you could just store the .data array, and 1) compare the shape; 2) sum the absolute value of the difference from old to new. This should handle the case where values are switched but the matrix sum stays the same.
I thought about just comparing A.data
, but this doesn't catch stuff like A.transpose()
and therefore A.indices
and A.indptr
would also need to be checked. I'll try to find the fastest method to do this.
Edit: It would catch A.transpose()
because the order changes, you're right. But it wouldn't catch the case, where one value is moved from one row to the next, because A.data
stays the same.
Inspired by smart people:
How about this:
import hashlib
def hash_csr_matrix(matrix):
return (
str(matrix.shape) +
hashlib.sha1(matrix.indices).hexdigest() +
hashlib.sha1(matrix.indptr).hexdigest() +
hashlib.sha1(matrix.data).hexdigest()
)
Takes ~2 milliseconds for ecoinvent 3.2 (cutoff), which seems reasonable.
I came up with this:
def csr_matrix_equal(a1, a2):
return (np.array_equal(a1.indptr, a2.indptr) and
np.array_equal(a1.indices, a2.indices) and
np.array_equal(a1.data, a2.data))
In my tests this is faster than hashing the matrices, even when the initial copying of one matrix is included. In case of a1 != a2
, comparison time is even lower because only the first statement (with the shortest array) needs to be evaluated.
Sure, comparing the values is ~10 times faster than hashing them. But you have to keep a copy of the matrix, so you tradeoff memory usage for speed.
For example, ecoinvent 3.2 (cutoff) is 1.4 MB of RAM:
indptr
: 13.000 times 4 bytesindices
: 120.000 times 4 bytesdata
: 120.000 times 8 bytesSo probably this tradeoff is worth it; solving a linear system with Pardiso when it is already factorized is on the order of a few milliseconds, so hashing would double the time. Plus you get guaranteed Py2/3 compatibility. Sigh.
True, I haven't really considered the memory usage. I have yet to test wether very large matrices run into RAM-limitations or take prohibitively long to solve. Maximum that i tested so far is ~300Mb, which takes around 3 min.
By the way, this is slightly faster:
def csr_matrix_equal2(a1, a2):
return all((np.array_equal(a1.indptr, a2.indptr),
np.array_equal(a1.indices, a2.indices),
np.array_equal(a1.data, a2.data)))
As I was trying to go to sleep last night, I was worried about the Monte Carlo classes, which do rebuild the matrices after doing an initial calculation using the normal solver. Luckily, these classes all rebuild the technosphere matrix, so the existing identity check should be enough to avoid the use of a stale cache in future calculations:
In [6]: mc = MonteCarloLCA({db.random(): 1})
In [7]: next(mc)
Out[7]:
array([ -3.84992617e-05, -4.49016549e-09, -4.14491504e-06, ...,
-2.54873438e-02, -2.40185083e-04, -1.32010176e+00])
In [8]: id(mc.technosphere_matrix)
Out[8]: 4502266096
In [9]: next(mc)
Out[9]:
array([ -3.64070020e-05, -3.45972601e-09, -3.77893232e-06, ...,
-2.23343439e-02, -6.50755958e-04, -1.19602921e+00])
In [10]: id(mc.technosphere_matrix)
Out[10]: 4503150776
Both the MonteCarloLCA and ParameterVectorLCA use rebuild_technosphere_matrix, which creates a new CSR matrix object each time.
Maybe a good balance between compute speed and memory usage would be to store a copy of the matrix up until a certain size, but then switch to checking the hash values, with an advanced option to skip this check entirely.
The rework of the wrapper in cfaa4c58ba276b8f5290ba2999d43537b9f7dde6 now includes the hashing and the comparison functions discussed above. The generic threshold for keeping a copy in memory is atm 5e7 non-zero entries, which is roughly 600 Mb. This threshold can be set to zero to always use hashing or very high to always store a copy.
I'm thinking about adding a method to the wrapper that skips all checks in order to get maximum efficiency, but I'm not sure yet. Manually setting the phase to 33 and using the private _call_pardiso
method is already possible, but leads to segfaults when not called properly.
I'll close this issue as soon as I have written tests for this new functionality.
The hashing and the comparison functions are now part of v0.2.0. I'm closing this for now.
At the moment a simple identity comparison is applied to see if matrix A is already factorized:
if self.factorized_A is A: self.set_phase(33) else: self.set_phase(13)
While this is convenient and really fast, it breaks when matrix A is modified (e.g.
A[0,0] += 1
), becauseself.factorized_A is A
will returnTrue
and the factorization won't be updated.I can think of two ways to overcome this:
not((self.factorized_A != A).nnz)
I prefer the second one, but I have to check how much it affects the speed of the different methods.